Submitted by:
| # | Name | Id |
|---|
|Student 1| Gal Kesten | 316353176 | galkesten@campus.technion.ac.il |Student 2| Chen Pery | 313283657 | chenpery@campus.technion.ac.il
Introduction¶
In this assignment we'll create a from-scratch implementation of two fundemental deep learning concepts: the backpropagation algorithm and stochastic gradient descent-based optimizers. In addition, you will create a general-purpose multilayer perceptron, the core building block of deep neural networks. We'll visualize decision bounrdaries and ROC curves in the context of binary classification. Following that we will focus on convolutional networks with residual blocks. We'll create our own network architectures and train them using GPUs on the course servers, then we'll conduct architecture experiments to determine the the effects of different architectural decisions on the performance of deep networks.
General Guidelines¶
- Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
- Please read the course servers usage guide. It explains how to use and run your code on the course servers to benefit from training with GPUs.
- The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (unless you wish to play around). The only exception is to fill your name(s) in the above cell before submission. Please do not remove sections or change the order of any cells.
- All your code (and even answers to questions) should be written in the files
within the python package corresponding the assignment number (
hw1,hw2, etc). You can of course use any editor or IDE to work on these files.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 1: Backpropagation¶
In this part, we'll implement backpropagation and automatic differentiation from scratch and compare our implementations to PyTorch's built in implementation (autograd).
import torch
import unittest
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
Reminder: The backpropagation algorithm is at the core of training deep models. To state the problem we'll tackle in this notebook, imagine we have an L-layer MLP model, defined as $$ \hat{\vec{y}^i} = \vec{y}_L^i= \varphi_L \left( \mat{W}_L \varphi_{L-1} \left( \cdots \varphi_1 \left( \mat{W}_1 \vec{x}^i + \vec{b}_1 \right) \cdots \right) + \vec{b}_L \right), $$
a pointwise loss function $\ell(\vec{y}, \hat{\vec{y}})$ and an empirical loss over our entire data set, $$ L(\vec{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \ell(\vec{y}^i, \hat{\vec{y}^i}) + R(\vec{\theta}) $$
where $\vec{\theta}$ is a vector containing all network parameters, e.g. $\vec{\theta} = \left[ \mat{W}_{1,:}, \vec{b}_1, \dots, \mat{W}_{L,:}, \vec{b}_L \right]$.
In order to train our model we would like to calculate the derivative (or gradient, in the multivariate case) of the loss with respect to each and every one of the parameters, i.e. $\pderiv{L}{\mat{W}_j}$ and $\pderiv{L}{\vec{b}_j}$ for all $j$. Since the gradient "points" to the direction of functional increase, the negative gradient is often used as a descent direction for descent-based optimization algorithms. In other words, iteratively updating each parameter proportianally to it's negetive gradient can lead to convergence to a local minimum of the loss function.
Calculus tells us that as long as we know the derivatives of all the functions "along the way" ($\varphi_i(\cdot),\ \ell(\cdot,\cdot),\ R(\cdot)$) we can use the chain rule to calculate the derivative of the loss with respect to any one of the parameter vectors. Note that if the loss $L(\vec{\theta})$ is scalar (which is usually the case), the gradient of a parameter will have the same shape as the parameter itself (matrix/vector/tensor of same dimensions).
For deep models that are a composition of many functions, calculating the gradient of each parameter by hand and implementing hard-coded gradient derivations quickly becomes infeasible. Additionally, such code makes models hard to change, since any change potentially requires re-derivation and re-implementation of the entire gradient function.
The backpropagation algorithm, which we saw in the lecture, provides us with a effective method of applying the chain rule recursively so that we can implement gradient calculations of arbitrarily deep or complex models.
We'll now implement backpropagation using a modular approach, which will allow us to chain many components layers together and get automatic gradient calculation of the output with respect to the input or any intermediate parameter.
To do this, we'll define a Layer class. Here's the API of this class:
import hw2.layers as layers
help(layers.Layer)
Help on class Layer in module hw2.layers:
class Layer(abc.ABC)
| A Layer is some computation element in a network architecture which
| supports automatic differentiation using forward and backward functions.
|
| Method resolution order:
| Layer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __call__(self, *args, **kwargs)
| Call self as a function.
|
| __init__(self)
| Initialize self. See help(type(self)) for accurate signature.
|
| __repr__(self)
| Return repr(self).
|
| backward(self, dout)
| Computes the backward pass of the layer, i.e. the gradient
| calculation of the final network output with respect to each of the
| parameters of the forward function.
| :param dout: The gradient of the network with respect to the
| output of this layer.
| :return: A tuple with the same number of elements as the parameters of
| the forward function. Each element will be the gradient of the
| network output with respect to that parameter.
|
| forward(self, *args, **kwargs)
| Computes the forward pass of the layer.
| :param args: The computation arguments (implementation specific).
| :return: The result of the computation.
|
| params(self)
| :return: Layer's trainable parameters and their gradients as a list
| of tuples, each tuple containing a tensor and it's corresponding
| gradient tensor.
|
| train(self, training_mode=True)
| Changes the mode of this layer between training and evaluation (test)
| mode. Some layers have different behaviour depending on mode.
| :param training_mode: True: set the model in training mode. False: set
| evaluation mode.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'backward', 'forward', 'params'})
In other words, a Layer can be anything: a layer, an activation function, a loss function or generally any computation that we know how to derive a gradient for.
Each Layer must define a forward() function and a backward() function.
- The
forward()function performs the actual calculation/operation of the block and returns an output. - The
backward()function computes the gradient of the input and parameters as a function of the gradient of the output, according to the chain rule.
Here's a diagram illustrating the above explanation:

Note that the diagram doesn't show that if the function is parametrized, i.e. $f(\vec{x},\vec{y})=f(\vec{x},\vec{y};\vec{w})$, there are also gradients to calculate for the parameters $\vec{w}$.
The forward pass is straightforward: just do the computation. To understand the backward pass, imagine that there's some "downstream" loss function $L(\vec{\theta})$ and magically somehow we are told the gradient of that loss with respect to the output $\vec{z}$ of our block, i.e. $\pderiv{L}{\vec{z}}$.
Now, since we know how to calculate the derivative of $f(\vec{x},\vec{y};\vec{w})$, it means we know how to calculate $\pderiv{\vec{z}}{\vec{x}}$, $\pderiv{\vec{z}}{\vec{y}}$ and $\pderiv{\vec{z}}{\vec{w}}$ . Thanks to the chain rule, this is all we need to calculate the gradients of the loss w.r.t. the input and parameters:
$$ \begin{align} \pderiv{L}{\vec{x}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{x}}\\ \pderiv{L}{\vec{y}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{y}}\\ \pderiv{L}{\vec{w}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}} \end{align} $$
Comparison with PyTorch¶
PyTorch has the nn.Module base class, which may seem to be similar to our Layer since it also represents a computation element in a network.
However PyTorch's nn.Modules don't compute the gradient directly, they only define the forward calculations.
Instead, PyTorch has a more low-level API for defining a function and explicitly implementing it's forward() and backward(). See autograd.Function.
When an operation is performed on a tensor, it creates a Function instance which performs the operation and
stores any necessary information for calculating the gradient later on. Additionally, Functionss point to the
other Function objects representing the operations performed earlier on the tensor. Thus, a graph (or DAG)
of operations is created (this is not 100% exact, as the graph is actually composed of a different type of class which wraps the backward method, but it's accurate enough for our purposes).
A Tensor instance which was created by performing operations on one or more tensors with requires_grad=True, has a grad_fn property which is a Function instance representing the last operation performed to produce this tensor.
This exposes the graph of Function instances, each with it's own backward() function. Therefore, in PyTorch the backward() function is called on the tensors, not the modules.
Our Layers are therefore a combination of the ideas in Module and Function and we'll implement them together,
just to make things simpler.
Our goal here is to create a "poor man's autograd": We'll use PyTorch tensors,
but we'll calculate and store the gradients in our Layers (or return them).
The gradients we'll calculate are of the entire block, not individual operations on tensors.
To test our implementation, we'll use PyTorch's autograd.
Note that of course this method of tracking gradients is much more limited than what PyTorch offers. However it allows us to implement the backpropagation algorithm very simply and really see how it works.
Let's set up some testing instrumentation:
from hw2.grad_compare import compare_layer_to_torch
def test_block_grad(block: layers.Layer, x, y=None, delta=1e-3):
diffs = compare_layer_to_torch(block, x, y)
# Assert diff values
for diff in diffs:
test.assertLess(diff, delta)
# Show the compare function
compare_layer_to_torch??
Notes:
- After you complete your implementation, you should make sure to read and understand the
compare_layer_to_torch()function. It will help you understand what PyTorch is doing. - The value of
deltaabove is should not be needed. A correct implementation will give you adiffof exactly zero.
Layer Implementations¶
We'll now implement some Layers that will enable us to later build an MLP model of arbitrary depth, complete with automatic differentiation.
For each block, you'll first implement the forward() function.
Then, you will calculate the derivative of the block by hand with respect to each of its
input tensors and each of its parameter tensors (if any).
Using your manually-calculated derivation, you can then implement the backward() function.
Notice that we have intermediate Jacobians that are potentially high dimensional tensors. For example in the expression $\pderiv{L}{\vec{w}} = \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$, the term $\pderiv{\vec{z}}{\vec{w}}$ is a 4D Jacobian if both $\vec{z}$ and $\vec{w}$ are 2D matrices.
In order to implement the backpropagation algorithm efficiently, we need to implement every backward function without explicitly constructing this Jacobian. Instead, we're interested in directly calculating the vector-Jacobian product (VJP) $\pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$. In order to do this, you should try to figure out the gradient of the loss with respect to one element, e.g. $\pderiv{L}{\vec{w}_{1,1}}$ and extrapolate from there how to directly obtain the VJP.
Activation functions¶
(Leaky) ReLU¶
ReLU, or rectified linear unit is a very common activation function in deep learning architectures. In it's most standard form, as we'll implement here, it has no parameters.
We'll first implement the "leaky" version, defined as
$$ \mathrm{relu}(\vec{x}) = \max(\alpha\vec{x},\vec{x}), \ 0\leq\alpha<1 $$
This is similar to the ReLU activation we've seen in class, only that it has a small non-zero slope then it's input is negative. Note that it's not strictly differentiable, however it has sub-gradients, defined separately any positive-valued input and for negative-valued input.
TODO: Complete the implementation of the LeakyReLU class in the hw2/layers.py module.
N = 100
in_features = 200
num_classes = 10
eps = 1e-6
# Test LeakyReLU
alpha = 0.1
lrelu = layers.LeakyReLU(alpha=alpha)
x_test = torch.randn(N, in_features)
# Test forward pass
z = lrelu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.nn.LeakyReLU(alpha)(x_test), atol=eps))
# Test backward pass
test_block_grad(lrelu, x_test)
Comparing gradients... input diff=0.000
Now using the LeakyReLU, we can trivially define a regular ReLU block as a special case.
TODO: Complete the implementation of the ReLU class in the hw2/layers.py module.
# Test ReLU
relu = layers.ReLU()
x_test = torch.randn(N, in_features)
# Test forward pass
z = relu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.relu(x_test), atol=eps))
# Test backward pass
test_block_grad(relu, x_test)
Comparing gradients... input diff=0.000
Sigmoid¶
The sigmoid function $\sigma(x)$ is also sometimes used as an activation function. We have also seen it previously in the context of logistic regression.
The sigmoid function is defined as
$$ \sigma(\vec{x}) = \frac{1}{1+\exp(-\vec{x})}. $$
# Test Sigmoid
sigmoid = layers.Sigmoid()
x_test = torch.randn(N, in_features, in_features) # 3D input should work
# Test forward pass
z = sigmoid(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.sigmoid(x_test), atol=eps))
# Test backward pass
test_block_grad(sigmoid, x_test)
Comparing gradients... input diff=0.000
Hyperbolic Tangent¶
The hyperbolic tangent function $\tanh(x)$ is a common activation function used when the output should be in the range [-1, 1].
The tanh function is defined as
$$ \tanh(\vec{x}) = \frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-\vec{x})}. $$
# Test TanH
tanh = layers.TanH()
x_test = torch.randn(N, in_features, in_features) # 3D input should work
# Test forward pass
z = tanh(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.tanh(x_test), atol=eps))
# Test backward pass
test_block_grad(tanh, x_test)
Comparing gradients... input diff=0.000
Linear (fully connected) layer¶
First, we'll implement an affine transform layer, also known as a fully connected layer.
Given an input $\mat{X}$ the layer computes,
$$ \mat{Z} = \mat{X} \mattr{W} + \vec{b} ,~ \mat{X}\in\set{R}^{N\times D_{\mathrm{in}}},~ \mat{W}\in\set{R}^{D_{\mathrm{out}}\times D_{\mathrm{in}}},~ \vec{b}\in\set{R}^{D_{\mathrm{out}}}. $$
Notes:
- We write it this way to follow the implementation conventions.
- $N$ is the number of samples in the input (batch size). The input $\mat{X}$ will always be a tensor containing a batch dimension first.
- Thanks to broadcasting, $\vec{b}$ can remain a vector even though the input $\mat{X}$ is a matrix.
TODO: Complete the implementation of the Linear class in the hw2/layers.py module.
# Test Linear
out_features = 1000
fc = layers.Linear(in_features, out_features)
x_test = torch.randn(N, in_features)
# Test forward pass
z = fc(x_test)
test.assertSequenceEqual(z.shape, [N, out_features])
torch_fc = torch.nn.Linear(in_features, out_features,bias=True)
torch_fc.weight = torch.nn.Parameter(fc.w)
torch_fc.bias = torch.nn.Parameter(fc.b)
test.assertTrue(torch.allclose(torch_fc(x_test), z, atol=eps))
# Test backward pass
test_block_grad(fc, x_test)
# Test second backward pass
x_test = torch.randn(N, in_features)
z = fc(x_test)
z = fc(x_test)
test_block_grad(fc, x_test)
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000
Cross-Entropy Loss¶
As you know by know, cross-entropy is a common loss function for classification tasks. In class, we defined it as
$$\ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) = - {\vectr{y}} \log(\hat{\vec{y}})$$
where $\hat{\vec{y}} = \mathrm{softmax}(x)$ is a probability vector (the output of softmax on the class scores $\vec{x}$) and the vector $\vec{y}$ is a 1-hot encoded class label.
However, it's tricky to compute the gradient of softmax, so instead we'll define a version of cross-entropy that produces the exact same output but works directly on the class scores $\vec{x}$.
We can write, $$\begin{align} \ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) &= - {\vectr{y}} \log(\hat{\vec{y}}) = - {\vectr{y}} \log\left(\mathrm{softmax}(\vec{x})\right) \\ &= - {\vectr{y}} \log\left(\frac{e^{\vec{x}}}{\sum_k e^{x_k}}\right) \\ &= - \log\left(\frac{e^{x_y}}{\sum_k e^{x_k}}\right) \\ &= - \left(\log\left(e^{x_y}\right) - \log\left(\sum_k e^{x_k}\right)\right)\\ &= - x_y + \log\left(\sum_k e^{x_k}\right) \end{align}$$
Where the scalar $y$ is the correct class label, so $x_y$ is the correct class score.
Note that this version of cross entropy is also what's provided by PyTorch's nn module.
TODO: Complete the implementation of the CrossEntropyLoss class in the hw2/layers.py module.
# Test CrossEntropy
cross_entropy = layers.CrossEntropyLoss()
scores = torch.randn(N, num_classes)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
# Test forward pass
loss = cross_entropy(scores, labels)
expected_loss = torch.nn.functional.cross_entropy(scores, labels)
test.assertLess(torch.abs(expected_loss-loss).item(), 1e-5)
print('loss=', loss.item())
# Test backward pass
test_block_grad(cross_entropy, scores, y=labels)
loss= 2.7283618450164795 Comparing gradients... input diff=0.000
Building Models¶
Now that we have some working Layers, we can build an MLP model of arbitrary depth and compute end-to-end gradients.
First, lets copy an idea from PyTorch and implement our own version of the nn.Sequential Module.
This is a Layer which contains other Layers and calls them in sequence. We'll use this to build our MLP model.
TODO: Complete the implementation of the Sequential class in the hw2/layers.py module.
# Test Sequential
# Let's create a long sequence of layers and see
# whether we can compute end-to-end gradients of the whole thing.
seq = layers.Sequential(
layers.Linear(in_features, 100),
layers.Linear(100, 200),
layers.Linear(200, 100),
layers.ReLU(),
layers.Linear(100, 500),
layers.LeakyReLU(alpha=0.01),
layers.Linear(500, 200),
layers.ReLU(),
layers.Linear(200, 500),
layers.LeakyReLU(alpha=0.1),
layers.Linear(500, 1),
layers.Sigmoid(),
)
x_test = torch.randn(N, in_features)
# Test forward pass
z = seq(x_test)
test.assertSequenceEqual(z.shape, [N, 1])
# Test backward pass
test_block_grad(seq, x_test)
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 param#09 diff=0.000 param#10 diff=0.000 param#11 diff=0.000 param#12 diff=0.000 param#13 diff=0.000 param#14 diff=0.000
Now, equipped with a Sequential, all we have to do is create an MLP architecture.
We'll define our MLP with the following hyperparameters:
- Number of input features, $D$.
- Number of output classes, $C$.
- Sizes of hidden layers, $h_1,\dots,h_L$.
So the architecture will be:
FC($D$, $h_1$) $\rightarrow$ ReLU $\rightarrow$ FC($h_1$, $h_2$) $\rightarrow$ ReLU $\rightarrow$ $\cdots$ $\rightarrow$ FC($h_{L-1}$, $h_L$) $\rightarrow$ ReLU $\rightarrow$ FC($h_{L}$, $C$)
We'll also create a sequence of the above MLP and a cross-entropy loss, since it's the gradient of the loss that we need in order to train a model.
TODO: Complete the implementation of the MLP class in the hw2/layers.py module. Ignore the dropout parameter for now.
# Create an MLP model
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100])
print(mlp)
MLP, Sequential [0] Linear(self.in_features=200, self.out_features=100) [1] ReLU [2] Linear(self.in_features=100, self.out_features=50) [3] ReLU [4] Linear(self.in_features=50, self.out_features=100) [5] ReLU [6] Linear(self.in_features=100, self.out_features=10)
# Test MLP architecture
N = 100
in_features = 10
num_classes = 10
for activation in ('relu', 'sigmoid'):
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100], activation=activation)
test.assertEqual(len(mlp.sequence), 7)
num_linear = 0
for b1, b2 in zip(mlp.sequence, mlp.sequence[1:]):
if (str(b2).lower() == activation):
test.assertTrue(str(b1).startswith('Linear'))
num_linear += 1
test.assertTrue(str(mlp.sequence[-1]).startswith('Linear'))
test.assertEqual(num_linear, 3)
# Test MLP gradients
# Test forward pass
x_test = torch.randn(N, in_features)
print(x_test.shape)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
z = mlp(x_test)
test.assertSequenceEqual(z.shape, [N, num_classes])
# Create a sequence of MLPs and CE loss
seq_mlp = layers.Sequential(mlp, layers.CrossEntropyLoss())
loss = seq_mlp(x_test, y=labels)
test.assertEqual(loss.dim(), 0)
print(f'MLP loss={loss}, activation={activation}')
# Test backward pass
test_block_grad(seq_mlp, x_test, y=labels)
torch.Size([100, 10]) MLP loss=2.30924391746521, activation=relu Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 torch.Size([100, 10]) MLP loss=2.3934404850006104, activation=sigmoid Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000
If the above tests passed then congratulations - you've now implemented an arbitrarily deep model and loss function with end-to-end automatic differentiation!
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Question 1¶
Suppose we have a linear (i.e. fully-connected) layer with a weight tensor $\mat{W}$, defined with in_features=1024 and out_features=512. We apply this layer to an input tensor $\mat{X}$ containing a batch of N=64 samples. The output of the layer is denoted as $\mat{Y}$.
Consider the Jacobian tensor $\pderiv{\mat{Y}}{\mat{X}}$ of the output of the layer w.r.t. the input $\mat{X}$.
- What is the shape of this tensor?
- Is this Jacobian sparse (most elements zero by definition)? If so, why and which elements?
- Given the gradient of the output w.r.t. some downstream scalar loss $L$, $\delta\mat{Y}=\pderiv{L}{\mat{Y}}$, do we need to materialize the above Jacobian in order to calculate the downstream gratdient w.r.t. to the input ($\delta\mat{X}$)? If yes, explain why; if no, show how to calcualte it without materializing the Jacobian.
Consider the Jacobian tensor $\pderiv{\mat{Y}}{\mat{W}}$ of the output of the layer w.r.t. the layer weights $\mat{W}$. Answer questions A-C about it as well.
display_answer(hw2.answers.part1_q1)
Your answer: 1.
A. Since the Linear layer function is $XW^T+b =Z$ and the shape of $X$ is (64, 1024), We get that the shape of $W$ (512, 1024) and the shape of the output $Y$ is (64,512). The Jacobian tensor $\frac{\partial Y}{\partial X}$ captures how each element of $Y$ changes with respect to each element of $X$. Therefore the shape is (64, 512, 64, 1024).
B. We have that $Y_{ij}= \Sigma_{k=1}^{1024} =X_{ik}W^T_{kj} = X_{ik}W_{jk}$.
Thus, $$\frac{\partial Y_{ij}}{\partial{X_{tl}}} = \begin{cases} W_{j l} & \text{if } t=i \\ 0 & \text{else} \end{cases} $$ The Jacobian is a 4d Tensor such that $J[i, j] =\frac{\partial Y_{ij} }{\partial{X}}$ where $\frac{\partial Y_{ij}}{\partial{X}} \in M^{64, 1024}$. For each $\frac{\partial Y_{ij}}{\partial{X}}$ only the i-th row has non-zero elements, meaning only 1024 elements might be non-zero. This occurs for each of the 64*512 matrices (for each $Y_{ij}$).
Therefore, in every such matrix, only one row is non-zero, making the Jacobian tensor indeed sparse.
C. No, we can compute the partial gradient w.r.t L without calculating the jacobian tensor. $$ \frac{\partial L}{\partial{X}} = \Sigma_{i, j}{\frac{\partial L}{\partial{Y_{ij}}} \cdot \frac{\partial Y_{ij}}{\partial{X}} } = $$ $$ \Sigma_{i, j}{\frac{\partial L}{\partial{Y_{ij}}}} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ W_{j1} & W_{j2} & \cdots & W_{jk} & \cdots & W_{j1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} = \Sigma_{i, j} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial{Y_{ij}}} W_{j1} & \frac{\partial L}{\partial{Y_{ij}}} W_{j2} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} W_{jk} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} W_{j1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} $$
$$ = \Sigma_{i=1}^{64} \Sigma_{j=1}^{512} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial{Y_{ij}}} W_{j1} & \frac{\partial L}{\partial{Y_{ij}}} W_{j2} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} W_{jk} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} W_{j1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} $$ $$ = \Sigma_{i=1}^{64} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{ij}}} W_{j1} & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{ij}}} W_{j2} & \cdots & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{ij}}} W_{jk} & \cdots &\Sigma_{j=1}^{512} \frac{\partial L}{\partial{Y_{ij}}} W_{j1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} $$ $$ = \begin{pmatrix} \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{1j}}} W_{j1} & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{1j}}} W_{j2} & \cdots & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{1j}}} W_{jk} & \cdots &\Sigma_{j=1}^{512} \frac{\partial L}{\partial{Y_{1j}}} W_{j1024} \\ \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{2j}}} W_{j1} & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{2j}}} W_{j2} & \cdots & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{2j}}} W_{jk} & \cdots & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{2ij}}} W_{j1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{ij}}} W_{j1} & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{ij}}} W_{j2} & \cdots & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{ij}}} W_{jk} & \cdots &\Sigma_{j=1}^{512} \frac{\partial L}{\partial{Y_{ij}}} W_{j1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{64j}}} W_{j1} & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{64j}}} W_{j2} & \cdots & \Sigma_{j=1}^{512}\frac{\partial L}{\partial{Y_{64j}}} W_{jk} & \cdots &\Sigma_{j=1}^{512} \frac{\partial L}{\partial{Y_{64j}}} W_{j1024} \\ \end{pmatrix} = \frac{\partial L}{\partial{Y}} \cdot W =\delta\mat{Y}W $$
A. Since the Linear layer function is $XW^T+b =Z$ and the shape of $X$ is (64, 1024), We get that the shape of $W$ is (512, 1024) and the shape of the output $Y$ is (64,512). The Jacobian tensor $\frac{\partial Y}{\partial X}$ captures how each element of $Y$ changes with respect to each element of $W$. Therefore the shape is (64, 512, 512, 1024).
B. We have that $Y_{ij}= \Sigma_{k=1}^{1024}X_{ik}W^T_{kj} = \Sigma_{k=1}^{1024}X_{ik}W_{jk}$.
Thus, $$\frac{\partial Y_{ij}}{\partial{W_{tl}}} = \begin{cases} X_{i l} & \text{if } t=j \\ 0 & \text{else} \end{cases} $$ The Jacobian is a 4d Tensor such that $J[i, j] =\frac{\partial Y_{ij} }{\partial{W}}$ where $\frac{\partial Y_{ij}}{\partial{W}} \in M^{512, 1024}$. For each $\frac{\partial Y_{ij}}{\partial{W}}$ only the j-th row has non-zero elements, meaning only 1024 elements might be non-zero. This occurs for each of the 64*512 matrices (for each $Y_{ij}$).
Therefore, in every such matrix, only one row is non-zero, making the Jacobian tensor indeed sparse.
C. No, we can compute the partial gradient w.r.t L without calculating the jacobian tensor. $$ \frac{\partial L}{\partial{W}} = \Sigma_{i, j}{\frac{\partial L}{\partial{Y_{ij}}} \cdot \frac{\partial Y_{ij}}{\partial{W}} } = $$ $$ \Sigma_{i, j}{\frac{\partial L}{\partial{Y_{ij}}}} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ X_{i1} & X_{i2} & \cdots & X_{ik} & \cdots & X_{i1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} = \Sigma_{i, j} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial{Y_{ij}}} X_{i1} & \frac{\partial L}{\partial{Y_{ij}}} X_{i2} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} X_{ik} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} X_{i1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} $$
$$ = \Sigma_{j=1}^{512} \Sigma_{i=1}^{64} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial{Y_{ij}}} X_{i1} & \frac{\partial L}{\partial{Y_{ij}}} X_{i2} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} X_{ik} & \cdots & \frac{\partial L}{\partial{Y_{ij}}} X_{i1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} $$ $$ = \Sigma_{j=1}^{512} \begin{pmatrix} 0 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{ij}}} X_{i1} & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{ij}}} X_{i2} & \cdots & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{ij}}} X_{ik} & \cdots & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{ij}}} X_{i1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 0 \\ \end{pmatrix} $$ $$ = \begin{pmatrix} \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{i1}}} X_{i1} & \Sigma_{1=1}^{64}\frac{\partial L}{\partial{Y_{i1}}} X_{i2} & \cdots & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{i1}}} X_{ik} & \Sigma_{i=1}^{64}\cdots & \frac{\partial L}{\partial{Y_{i1}}} X_{i1024} \\ \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{i2}}} X_{i1} & \Sigma_{1=1}^{64}\frac{\partial L}{\partial{Y_{i2}}} X_{i2} & \cdots & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{i2}}} X_{ik} & \Sigma_{i=1}^{64}\cdots & \frac{\partial L}{\partial{Y_{i2}}} X_{i1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{ik}}} X_{i1} & \Sigma_{1=1}^{64}\frac{\partial L}{\partial{Y_{ik}}} X_{i2} & \cdots & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{ik}}} X_{ik} & \Sigma_{i=1}^{64}\cdots & \frac{\partial L}{\partial{Y_{ik}}} X_{i1024} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{i1024}}} X_{i1} & \Sigma_{1=1}^{64}\frac{\partial L}{\partial{Y_{i1024}}} X_{i2} & \cdots & \Sigma_{i=1}^{64}\frac{\partial L}{\partial{Y_{i1024}}} X_{ik} & \Sigma_{i=1}^{64}\cdots & \frac{\partial L}{\partial{Y_{i1024}}} X_{i1024} \\ \end{pmatrix} = {\frac{\partial L}{\partial{Y}}}^T X = (\delta\mat{Y})^TX $$
Question 2¶
Is back-propagation required in order to train neural networks with decent-based optimization? Why or why not?
display_answer(hw2.answers.part1_q2)
Your answer: It is theoretically possible to compute gradients without backpropagation. Alternative methods, such as finite differences and forward mode automatic differentiation (AD), can also optimize neural networks. However, these methods are generally much less efficient and practical compared to backpropagation, especially for large networks and when the number of outputs is small.
Therefore, while not absolutely required, backpropagation is the preferred method for training neural networks. It efficiently and accurately computes the gradients of the loss function with respect to all the weights in the network. By applying the chain rule of calculus, backpropagation propagates the error backward through the network, layer by layer, allowing for effective and scalable training of deep networks. Without backpropagation, calculating these gradients would be computationally infeasible and time-consuming for large networks.
(Finite differences is a numerical method to approximate the gradient of the loss function with respect to the weights. It involves perturbing each weight slightly and observing the change in the loss function.)
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 2: Optimization and Training¶
In this part we will learn how to implement optimization algorithms for deep networks. Additionally, we'll learn how to write training loops and implement a modular model trainer. We'll use our optimizers and training code to test a few configurations for classifying images with an MLP model.
import os
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Implementing Optimization Algorithms¶
In the context of deep learning, an optimization algorithm is some method of iteratively updating model parameters so that the loss converges toward some local minimum (which we hope will be good enough).
Gradient descent-based methods are by far the most popular algorithms for optimization of neural network parameters. However the high-dimensional loss-surfaces we encounter in deep learning applications are highly non-convex. They may be riddled with local minima, saddle points, large plateaus and a host of very challenging "terrain" for gradient-based optimization. This gave rise to many different methods of performing the parameter updates based on the loss gradients, aiming to tackle these optimization challenges.
The most basic gradient-based update rule can be written as,
$$ \vec{\theta} \leftarrow \vec{\theta} - \eta \nabla_{\vec{\theta}} L(\vec{\theta}; \mathcal{D}) $$
where $\mathcal{D} = \left\{ (\vec{x}^i, \vec{y}^i) \right\}_{i=1}^{M}$ is our training dataset or part of it. Specifically, if we have in total $N$ training samples, then
- If $M=N$ this is known as regular gradient descent. If the dataset does not fit in memory the gradient of this loss becomes infeasible to compute.
- If $M=1$, the loss is computed w.r.t. a single different sample each time. This is known as stochastic gradient descent.
- If $1<M<N$ this is known as stochastic mini-batch gradient descent. This is the most commonly-used option.
The intuition behind gradient descent is simple: since the gradient of a multivariate function points to the direction of steepest ascent ("uphill"), we move in the opposite direction. A small step size $\eta$ known as the learning rate is required since the gradient can only serve as a first-order linear approximation of the function's behaviour at $\vec{\theta}$ (recall e.g. the Taylor expansion). However in truth our loss surface generally has nontrivial curvature caused by a high order nonlinear dependency on $\vec{\theta}$. Thus taking a large step in the direction of the gradient is actually just as likely to increase the function value.

The idea behind the stochastic versions is that by constantly changing the samples we compute the loss with, we get a dynamic error surface, i.e. it's different for each set of training samples. This is thought to generally improve the optimization since it may help the optimizer get out of flat regions or sharp local minima since these features may disappear in the loss surface of subsequent batches. The image below illustrates this. The different lines are different 1-dimensional losses for different training set-samples.

Deep learning frameworks generally provide implementations of various gradient-based optimization algorithms.
Here we'll implement our own optimization module from scratch, this time keeping a similar API to the PyTorch optim package.
We define a base Optimizer class. An optimizer holds a set of parameter tensors (these are the trainable parameters of some model) and maintains internal state. It may be used as follows:
- After the forward pass has been performed the optimizer's
zero_grad()function is invoked to clear the parameter gradients computed by previous iterations. - After the backward pass has been performed, and gradients have been calculated for these parameters, the optimizer's
step()function is invoked in order to update the value of each parameter based on it's gradient.
The exact method of update is implementation-specific for each optimizer and may depend on its internal state. In addition, adding the regularization penalty to the gradient is handled by the optimizer since it only depends on the parameter values (and not the data).
Here's the API of our Optimizer:
import hw2.optimizers as optimizers
help(optimizers.Optimizer)
Help on class Optimizer in module hw2.optimizers:
class Optimizer(abc.ABC)
| Optimizer(params)
|
| Base class for optimizers.
|
| Method resolution order:
| Optimizer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __init__(self, params)
| :param params: A sequence of model parameters to optimize. Can be a
| list of (param,grad) tuples as returned by the Layers, or a list of
| pytorch tensors in which case the grad will be taken from them.
|
| step(self)
| Updates all the registered parameter values based on their gradients.
|
| zero_grad(self)
| Sets the gradient of the optimized parameters to zero (in place).
|
| ----------------------------------------------------------------------
| Readonly properties defined here:
|
| params
| :return: A sequence of parameter tuples, each tuple containing
| (param_data, param_grad). The data should be updated in-place
| according to the grad.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'step'})
Vanilla SGD with Regularization¶
Let's start by implementing the simplest gradient based optimizer. The update rule will be exacly as stated above, but we'll also add a L2-regularization term to the gradient. Remember that in the loss function, the L2 regularization term is expressed by
$$R(\vec{\theta}) = \frac{1}{2}\lambda||\vec{\theta}||^2_2.$$
TODO: Complete the implementation of the VanillaSGD class in the hw2/optimizers.py module.
# Test VanillaSGD
torch.manual_seed(42)
p = torch.randn(500, 10)
dp = torch.randn(*p.shape)*2
params = [(p, dp)]
vsgd = optimizers.VanillaSGD(params, learn_rate=0.5, reg=0.1)
vsgd.step()
expected_p = torch.load('tests/assets/expected_vsgd.pt')
diff = torch.norm(p-expected_p).item()
print(f'diff={diff}')
test.assertLess(diff, 1e-3)
diff=1.0932822078757454e-06
Training¶
Now that we can build a model and loss function, compute their gradients and we have an optimizer, we can finally do some training!
In the spirit of more modular software design, we'll implement a class that will aid us in automating the repetitive training loop code that we usually write over and over again. This will be useful for both training our Layer-based models and also later for training PyTorch nn.Modules.
Here's our Trainer API:
import hw2.training as training
help(training.Trainer)
Help on class Trainer in module hw2.training:
class Trainer(abc.ABC)
| Trainer(model: torch.nn.modules.module.Module, device: Union[torch.device, NoneType] = None)
|
| A class abstracting the various tasks of training models.
|
| Provides methods at multiple levels of granularity:
| - Multiple epochs (fit)
| - Single epoch (train_epoch/test_epoch)
| - Single batch (train_batch/test_batch)
|
| Method resolution order:
| Trainer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __init__(self, model: torch.nn.modules.module.Module, device: Union[torch.device, NoneType] = None)
| Initialize the trainer.
| :param model: Instance of the model to train.
| :param device: torch.device to run training on (CPU or GPU).
|
| fit(self, dl_train: torch.utils.data.dataloader.DataLoader, dl_test: torch.utils.data.dataloader.DataLoader, num_epochs: int, checkpoints: str = None, early_stopping: int = None, print_every: int = 1, **kw) -> cs236781.train_results.FitResult
| Trains the model for multiple epochs with a given training set,
| and calculates validation loss over a given validation set.
| :param dl_train: Dataloader for the training set.
| :param dl_test: Dataloader for the test set.
| :param num_epochs: Number of epochs to train for.
| :param checkpoints: Whether to save model to file every time the
| test set accuracy improves. Should be a string containing a
| filename without extension.
| :param early_stopping: Whether to stop training early if there is no
| test loss improvement for this number of epochs.
| :param print_every: Print progress every this number of epochs.
| :return: A FitResult object containing train and test losses per epoch.
|
| save_checkpoint(self, checkpoint_filename: str)
| Saves the model in it's current state to a file with the given name (treated
| as a relative path).
| :param checkpoint_filename: File name or relative path to save to.
|
| test_batch(self, batch) -> cs236781.train_results.BatchResult
| Runs a single batch forward through the model and calculates loss.
| :param batch: A single batch of data from a data loader (might
| be a tuple of data and labels or anything else depending on
| the underlying dataset.
| :return: A BatchResult containing the value of the loss function and
| the number of correctly classified samples in the batch.
|
| test_epoch(self, dl_test: torch.utils.data.dataloader.DataLoader, **kw) -> cs236781.train_results.EpochResult
| Evaluate model once over a test set (single epoch).
| :param dl_test: DataLoader for the test set.
| :param kw: Keyword args supported by _foreach_batch.
| :return: An EpochResult for the epoch.
|
| train_batch(self, batch) -> cs236781.train_results.BatchResult
| Runs a single batch forward through the model, calculates loss,
| preforms back-propagation and updates weights.
| :param batch: A single batch of data from a data loader (might
| be a tuple of data and labels or anything else depending on
| the underlying dataset.
| :return: A BatchResult containing the value of the loss function and
| the number of correctly classified samples in the batch.
|
| train_epoch(self, dl_train: torch.utils.data.dataloader.DataLoader, **kw) -> cs236781.train_results.EpochResult
| Train once over a training set (single epoch).
| :param dl_train: DataLoader for the training set.
| :param kw: Keyword args supported by _foreach_batch.
| :return: An EpochResult for the epoch.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'test_batch', 'train_batch'})
The Trainer class splits the task of training (and evaluating) models into three conceptual levels,
- Multiple epochs - the
fitmethod, which returns aFitResultcontaining losses and accuracies for all epochs. - Single epoch - the
train_epochandtest_epochmethods, which return anEpochResultcontaining losses per batch and the single accuracy result of the epoch. - Single batch - the
train_batchandtest_batchmethods, which return aBatchResultcontaining a single loss and the number of correctly classified samples in the batch.
It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.
The first thing we should do in order to verify our model, gradient calculations and optimizer implementation is to try to overfit a large model (many parameters) to a small dataset (few images). This will show us that things are working properly.
Let's begin by loading the CIFAR-10 dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
Files already downloaded and verified
Files already downloaded and verified
Train: 50000 samples Test: 10000 samples
Now, let's implement just a small part of our training logic since that's what we need right now.
TODO:
- Complete the implementation of the
train_batch()method in theLayerTrainerclass within thehw2/training.pymodule. - Update the hyperparameter values in the
part2_overfit_hp()function in thehw2/answers.pymodule. Tweak the hyperparameter values until your model overfits a small number of samples in the code block below. You should get 100% accuracy within a few epochs.
The following code block will use your custom Layer-based MLP implentation, custom Vanilla SGD and custom trainer to overfit the data. The classification accuracy should be 100% within a few epochs.
import hw2.layers as layers
import hw2.answers as answers
from torch.utils.data import DataLoader
# Overfit to a very small dataset of 20 samples
batch_size = 10
max_batches = 2
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Get hyperparameters
hp = answers.part2_overfit_hp()
torch.manual_seed(seed)
# Build a model and loss using our custom MLP and CE implementations
model = layers.MLP(3*32*32, num_classes=10, hidden_features=[128]*3, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()
# Use our custom optimizer
optimizer = optimizers.VanillaSGD(model.params(), learn_rate=hp['lr'], reg=hp['reg'])
# Run training over small dataset multiple times
trainer = training.LayerTrainer(model, loss_fn, optimizer)
best_acc = 0
for i in range(20):
res = trainer.train_epoch(dl_train, max_batches=max_batches)
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
test.assertGreaterEqual(best_acc, 98)
Now that we know training works, let's try to fit a model to a bit more data for a few epochs, to see how well we're doing. First, we need a function to plot the FitResults object.
from cs236781.plot import plot_fit
plot_fit?
TODO:
- Complete the implementation of the
test_batch()method in theLayerTrainerclass within thehw2/training.pymodule. - Implement the
fit()method of theTrainerclass within thehw2/training.pymodule. - Tweak the hyperparameters for this section in the
part2_optim_hp()function in thehw2/answers.pymodule. - Run the following code blocks to train. Try to get above 35-40% test-set accuracy.
# Define a larger part of the CIFAR-10 dataset (still not the whole thing)
batch_size = 50
max_batches = 100
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size//2, shuffle=False)
# Define a function to train a model with our Trainer and various optimizers
def train_with_optimizer(opt_name, opt_class, fig):
torch.manual_seed(seed)
# Get hyperparameters
hp = answers.part2_optim_hp()
hidden_features = [128] * 5
num_epochs = 10
# Create model, loss and optimizer instances
model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()
optimizer = opt_class(model.params(), learn_rate=hp[f'lr_{opt_name}'], reg=hp['reg'])
# Train with the Trainer
trainer = training.LayerTrainer(model, loss_fn, optimizer)
fit_res = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches)
fig, axes = plot_fit(fit_res, fig=fig, legend=opt_name)
return fig
fig_optim = None
fig_optim = train_with_optimizer('vanilla', optimizers.VanillaSGD, fig_optim)
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
Momentum¶
The simple vanilla SGD update is rarely used in practice since it's very slow to converge relative to other optimization algorithms.
One reason is that naïvely updating in the direction of the current gradient causes it to fluctuate wildly in areas where the loss surface in some dimensions is much steeper than in others. Another reason is that using the same learning rate for all parameters is not a great idea since not all parameters are created equal. For example, parameters associated with rare features should be updated with a larger step than ones associated with commonly-occurring features because they'll get less updates through the gradients.
Therefore more advanced optimizers take into account the previous gradients of a parameter and/or try to use a per-parameter specific learning rate instead of a common one.
Let's now implement a simple and common optimizer: SGD with Momentum. This optimizer takes previous gradients of a parameter into account when updating it's value instead of just the current one. In practice it usually provides faster convergence than the vanilla SGD.
The SGD with Momentum update rule can be stated as follows: $$\begin{align} \vec{v}_{t+1} &= \mu \vec{v}_t - \eta \delta \vec{\theta}_t \\ \vec{\theta}_{t+1} &= \vec{\theta}_t + \vec{v}_{t+1} \end{align}$$
Where $\eta$ is the learning rate, $\vec{\theta}$ is a model parameter, $\delta \vec{\theta}_t=\pderiv{L}{\vec{\theta}}(\vec{\theta}_t)$ is the gradient of the loss w.r.t. to the parameter and $0\leq\mu<1$ is a hyperparameter known as momentum.
Expanding the update rule recursively shows us now the parameter update infact depends on all previous gradient values for that parameter, where the old gradients are exponentially decayed by a factor of $\mu$ at each timestep.
Since we're incorporating previous gradient (update directions), a noisy value of the current gradient will have less effect so that the general direction of previous updates is maintained somewhat. The following figure illustrates this.

TODO:
- Complete the implementation of the
MomentumSGDclass in thehw2/optimizers.pymodule. - Tweak the learning rate for momentum in
part2_optim_hp()the function in thehw2/answers.pymodule. - Run the following code block to compare to the vanilla SGD.
fig_optim = train_with_optimizer('momentum', optimizers.MomentumSGD, fig_optim)
fig_optim
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
Bonus: RMSProp¶
This is another optmizer that accounts for previous gradients, but this time it uses them to adapt the learning rate per parameter.
RMSProp maintains a decaying moving average of previous squared gradients, $$ \vec{r}_{t+1} = \gamma\vec{r}_{t} + (1-\gamma)\delta\vec{\theta}_t^2 $$ where $0<\gamma<1$ is a decay constant usually set close to $1$, and $\delta\vec{\theta}_t^2$ denotes element-wise squaring.
The update rule for each parameter is then, $$ \vec{\theta}_{t+1} = \vec{\theta}_t - \left( \frac{\eta}{\sqrt{r_{t+1}+\varepsilon}} \right) \delta\vec{\theta}_t $$
where $\varepsilon$ is a small constant to prevent numerical instability. The idea here is to decrease the learning rate for parameters with high gradient values and vice-versa. The decaying moving average prevents accumulating all the past gradients which would cause the effective learning rate to become zero.
Bonus:
- Complete the implementation of the
RMSPropclass in thehw2/optimizers.pymodule. - Tweak the learning rate for RMSProp in
part2_optim_hp()the function in thehw2/answers.pymodule. - Run the following code block to compare to the other optimizers.
fig_optim = train_with_optimizer('rmsprop', optimizers.RMSProp, fig_optim)
fig_optim
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
Note that you should get better train/test accuracy with Momentum and RMSProp than Vanilla.
Dropout Regularization¶
Dropout is a useful technique to improve generalization of deep models.
The idea is simple: during the forward pass drop, i.e. set to to zero, the activation of each neuron, with a probability of $p$. For example, if $p=0.4$ this means we drop the activations of 40% of the neurons (on average).
There are a few important things to note about dropout:
- It is only performed during training. When testing our model the dropout layers should be a no-op.
- In the backward pass, gradients are only propagated back into neurons that weren't dropped during the forward pass.
- During testing, the activations must be scaled since the expected value of each neuron during the training phase is now $1-p$ times it's original expectation. Thus, we need to scale the test-time activations by $1-p$ to match. Equivalently, we can scale the train time activations by $1/(1-p)$.
TODO:
- Complete the implementation of the
Dropoutclass in thehw2/layers.pymodule. - Finish the implementation of the
MLP's__init__()method in thehw2/layers.pymodule. Ifdropout>0you should add aDropoutlayer after eachReLU.
from hw2.grad_compare import compare_layer_to_torch
# Check architecture of MLP with dropout layers
mlp_dropout = layers.MLP(in_features, num_classes, [50]*3, dropout=0.6)
print(mlp_dropout)
test.assertEqual(len(mlp_dropout.sequence), 10)
for b1, b2 in zip(mlp_dropout.sequence, mlp_dropout.sequence[1:]):
if str(b1).lower() == 'relu':
test.assertTrue(str(b2).startswith('Dropout'))
test.assertTrue(str(mlp_dropout.sequence[-1]).startswith('Linear'))
MLP, Sequential [0] Linear(self.in_features=3072, self.out_features=50) [1] ReLU [2] Dropout(p=0.6) [3] Linear(self.in_features=50, self.out_features=50) [4] ReLU [5] Dropout(p=0.6) [6] Linear(self.in_features=50, self.out_features=50) [7] ReLU [8] Dropout(p=0.6) [9] Linear(self.in_features=50, self.out_features=10)
# Test end-to-end gradient in train and test modes.
print('Dropout, train mode')
mlp_dropout.train(True)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
test.assertLess(diff, 1e-3)
print('Dropout, test mode')
mlp_dropout.train(False)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
test.assertLess(diff, 1e-3)
Dropout, train mode
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 Dropout, test mode Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000
To see whether dropout really improves generalization, let's take a small training set (small enough to overfit) and a large test set and check whether we get less overfitting and perhaps improved test-set accuracy when using dropout.
# Define a small set from CIFAR-10, but take a larger test set since we want to test generalization
batch_size = 10
max_batches = 40
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size*2, shuffle=False)
TODO:
Tweak the hyperparameters for this section in the part2_dropout_hp() function in the hw2/answers.py module. Try to set them so that the first model (with dropout=0) overfits. You can disable the other dropout options until you tune the hyperparameters. We can then see the effect of dropout for generalization.
# Get hyperparameters
hp = answers.part2_dropout_hp()
hidden_features = [400] * 1
num_epochs = 30
torch.manual_seed(seed)
fig=None
#for dropout in [0]: # Use this for tuning the hyperparms until you overfit
for dropout in [0, 0.4, 0.8]:
model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'], dropout=dropout)
loss_fn = layers.CrossEntropyLoss()
optimizer = optimizers.MomentumSGD(model.params(), learn_rate=hp['lr'], reg=0)
print('*** Training with dropout=', dropout)
trainer = training.LayerTrainer(model, loss_fn, optimizer)
fit_res_dropout = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches, print_every=6)
fig, axes = plot_fit(fit_res_dropout, fig=fig, legend=f'dropout={dropout}', log_loss=True)
*** Training with dropout= 0 --- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
*** Training with dropout= 0.4 --- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
*** Training with dropout= 0.8 --- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Question 1¶
Regarding the graphs you got for the three dropout configurations:
Explain the graphs of no-dropout vs dropout. Do they match what you expected to see?
- If yes, explain why and provide examples based on the graphs.
- If no, explain what you think the problem is and what should be modified to fix it.
Compare the low-dropout setting to the high-dropout setting and explain based on your graphs.
display_answer(hw2.answers.part2_q1)
Your answer:
1.Without dropout, the model quickly achieves high training accuracy and low training loss because it can utilize all neurons during training, allowing it to closely fit the training data, which matches our expectations. However, during testing, the loss is significantly higher and accuracy is lower, as expected, because the model overfits to the training data by relying heavily on all neurons, resulting in poor generalization to new data.
With dropout, the training loss is higher and the training accuracy is lower compared to without dropout. However, the test loss is significantly lower and the test accuracy improves slightly, as expected. This is because dropout disables a fraction of neurons during training, making it harder for the model to rely on specific neurons. This helps address overfitting by forcing the model to generalize better and not depend too heavily on particular neurons and weights.
2.The graphs show that using a dropout rate of 0.4 led to better results in terms of both loss and test accuracy during training and testing compared to using a higher dropout rate.
During training with a high dropout rate, the model underfits the data, which is indicated by relatively high loss and low accuracy. This underfitting continues during testing, where the loss remains high and accuracy is low, suggesting the model struggles to fit the training data properly and doesn't generalize well, even though it performs slightly better during testing than with no dropout at all.
The reason for this is that a too-high dropout rate can be problematic because the model relies on too few neurons each time, making it difficult for effective learning. This decreases the model's capacity and prevents it from learning complex patterns, leading to the poor performance observed in the graphs.
Therefore, we conclude that dropout is a useful technique to improve the generalization of deep models. However, it's important to balance the dropout rate to ensure the model can still learn effectively.
Question 2¶
When training a model with the cross-entropy loss function, is it possible for the test loss to increase for a few epochs while the test accuracy also increases?
If it's possible explain how, if it's not explain why not.
display_answer(hw2.answers.part2_q2)
Your answer: Yes, it is possible for the test loss to increase for a few epochs while the test accuracy also increases. This can occur because test accuracy only measures the proportion of correct predictions, while cross-entropy loss accounts for the confidence of those predictions. If the model starts making correct predictions with lower confidence or incorrect predictions with higher confidence, the accuracy can improve even as the loss increases.
We can observe this phenomenon in our graph. For example, with dropout=0.4, it occurs between iterations 6-7, and with dropout=0, it occurs at iteration 15. Although these specific instances may vary in future runs, this example demonstrates that it can indeed happen.
example : Consider a binary classification problem where the true labels are 0 and 1.
Epoch 1:
Predictions for 4 samples: $\hat{y}$= [0.4,0.8,0.6,0.3] , True labels: [0,1,1,0]
Cross-Entropy Loss Calculation: $L_{CE} = -\frac{1}{4}(log(0.6)+log(0.8)+log(0.6)+log(0.7))≈0.437$
Accuracy: $3/4$
Epoch 2:
Predictions for 4 samples: $\hat{y}$= [0.45,0.9,0.55,0.4], True labels: [0,1,1,0]
Cross-Entropy Loss Calculation: $L_{CE} = -\frac{1}{4}(log(0.55)+log(0.9)+log(0.55)+log(0.6))≈0.479$
Accuracy: $4/4$
This demonstrates how the test loss can increase while the test accuracy also increases.
Question 3¶
Explain the difference between gradient descent and back-propagation.
Compare in detail between gradient descent (GD) and stochastic gradient descent (SGD).
Why is SGD used more often in the practice of deep learning? Provide a few justifications.
You would like to try GD to train your model instead of SGD, but you're concerned that your dataset won't fit in memory. A friend suggested that you should split the data into disjoint batches, do multiple forward passes until all data is exhausted, and then do one backward pass on the sum of the losses.
- Would this approach produce a gradient equivalent to GD? Why or why not? provide mathematical justification for your answer.
- You implemented the suggested approach, and were careful to use batch sizes small enough so that each batch fits in memory. However, after some number of batches you got an out of memory error. What happened?
display_answer(hw2.answers.part2_q3)
Your answer:
Gradient Descent: An optimization technique that aims to minimize a loss function by iteratively adjusting the model's parameters. It works by moving in the direction of the steepest descent to find a local minimum.
Back-Propagation: A method for efficiently computing the gradients of the loss function with respect to the model's parameters. The algorithm is based on the chain rule from calculus. Back-propagation is typically used in optimization algorithms that require gradients.
In conclusion, Gradient Descent is an optimization algorithm that aims to minimize a loss function by iteratively adjusting the model parameters in the direction of the negative gradient. It calculates the gradient of the loss with respect to the parameters over the entire dataset and updates the parameters accordingly. On the other hand, Back-propagation is a technique used to efficiently compute the gradients of the loss function with respect to each parameter in a neural network. It applies the chain rule to propagate the error backward from the output layer to the input layer, calculating the gradient for each layer. While Gradient Descent is about the optimization process, Back-propagation is a method for calculating the necessary gradients to perform this optimization.
Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are both optimization techniques used to minimize loss functions by updating model parameters iteratively. In GD, the update rule involves computing the gradient of the loss function with respect to all training data. In contrast, SGD updates the model parameters using the gradient computed from a single randomly selected training example at each iteration. The expectation of the noisy gradient updates in SGD is equal to the true gradient in GD, allowing SGD to approximate the true gradient direction over many iterations. SGD, updates the parameters using only a single or a small subset of the training dataset at each iteration, this results is faster and more computationally efficient updates, but the convergence path can be noisier and less stable than GD. While GD has stable and smooth convergence, it can be computationally expensive for large datasets. therefore ,SGD is preferred for large datasets because it allows for faster updates and can handle datasets that do not fit into memory by processing in smaller batches.
Stochastic Gradient Descent (SGD) is widely used in deep learning for several reasons:
-it is computationally efficient, especially for large datasets, because it does not require loading the entire dataset into memory and it provides faster convergence compared to full-batch Gradient Descent, as updates are made more frequently. (Gradient Descent has significant memory and computation demands since it requires processing the entire dataset for a single optimization step. This makes it impractical for large datasets.)
-In Stochastic Gradient Descent (SGD), the error surface is dynamic, changing with each batch of training samples. This variability can enhance optimization by helping the optimizer escape flat regions or sharp local minima, as these problematic features may be smoothed out in the loss surface of subsequent batches.
-SGD introduces noise due to its random sampling, which acts as a form of regularization. This noise can prevent the optimizer from converging to a minimum that perfectly matches the training data, thereby reducing the risk of overfitting and improving generalization to unseen data.
A.
The methods have the same loss output after going through entire dataset: Mathematical Justification"
Let the dataset be $\mathcal{S}$ with size $N$.
Split $\mathcal{S}$ into $M$ disjoint batches $\{\mathcal{B}_1, \mathcal{B}_2, \ldots, \mathcal{B}_M\}$ where each batch $\mathcal{B}_j$ contains $N_j$ samples such that $\sum_{j=1}^{M} N_j = N$.
The total loss over the entire dataset is:
$$ L(\theta) = \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i) $$
When split into batches, the total loss can be written as the sum of losses for each batch:
$$ L_B(\theta) = \sum_{j=1}^{M} \sum_{i \in \mathcal{B}_j} \ell(f_\theta(x_i), y_i) $$
Since all the batches are disjoint sets, where their union is equal to entire S we can deduce that $L_B(\theta) = \sum_{j=1}^{M} \sum_{i \in \mathcal{B}_j} \ell(f_\theta(x_i), y_i) = \sum_{j=1}^{N} \ell(f_\theta(x_i), y_i) = L(\theta)$.
If the losses are equivalent, then theoretically we should get the same gradient updates when calculating the gradient with respect to the network parameters. The problem is that we also need to accumulate outputs if we use the chain rule, as we will explain in the next section.
B.
Even though we are using batch sizes small enough to fit into memory, the out-of-memory error likely occurred because of the accumulation of intermediate activations results. When performing multiple forward passes before doing a single backward pass, the intermediate results need to be stored in memory until the backward pass is performed. If you accumulate these intermediate results over many batches without releasing memory, the memory usage can grow significantly, leading to an out-of-memory error. To avoid this, you should perform backward passes and parameter updates for each batch individually rather than accumulating all batches before updating. This approach ensures that memory is freed up after each batch is processed, preventing excessive memory usage.
Question 4 (Automatic Differentiation)¶
Let $f = f_n \circ f_{n-1} \circ ... \circ f_1$ where each $f_i: \mathbb{R} \rightarrow \mathbb{R}$ is a differentiable function which is easy to evaluate and differentiate (each query costs $\mathcal{O}(1)$ at a given point).
- In this exercise you will reduce the memory complexity for evaluating $\nabla f (x_0)$ at some point $x_0$.
Assume that you are given with $f$ already expressed as a computational graph and a point $x_0$.
- Show how to reduce the memory complexity for computing the gradient using forward mode AD (maintaining the $\mathcal{O}(n)$ computation cost). What is the memory complexity?
- Show how to reduce the memory complexity for computing the gradient using backward mode AD (maintaining the $\mathcal{O}(n)$ computation cost). What is the memory complexity?
- Can these techniques be generalized for arbitrary computational graphs?
- Think how the backprop algorithm can benefit from these techniques when applied to deep architectures (e.g VGGs, ResNets).
display_answer(hw2.answers.part2_q4)
Your answer:
4.1. Given a computation graph where we have an edge from $f_{j-1}$ to $f_{j}$ if $f_{j-1}(a)$ is the input for $f_{j}$, we can compute both the value of f(x0) and the value of the f'(x0) without storing intermediate values.
The algorithm:
Initialization:
- Set $x \leftarrow x_0$
- Set $\text{gradient} \leftarrow 1$
Forward Pass:
- For $j = 1$ to $n$:
- $\text{value} \leftarrow f_j(x)$
- $\text{gradient} \leftarrow \text{gradient} \cdot f_j'(x)$
- $x \leftarrow \text{value}$
- For $j = 1$ to $n$:
Result:
- The final $\text{gradient}$ is $\nabla f(x_0)$
- Memory Usage: The algorithm uses only two variables $x$ and $\text{gradient}$ throughout the computation, which requires $O(1)$ memory. The computational complexity remains linear, $O(n)$.
4.2
Given a computation graph where we have an edge from $f_{j-1}$ to $f_{j}$ if $f_{j-1}(a)$ is the input for $f_{j}$, we can compute the value of the f' without storing intermediate values. However, in this case we still need to store values from the first forward pass.
If each node in the computational graph store the values of the function the algorithm will be:
The algorithm:
Initialization:
- Set $\text{gradient} \leftarrow 1$
Forward Pass:
- For $j =n-1$ to $0$:
- $\text{gradient} \leftarrow \text{gradient} \cdot v_{j+1}.fn.derivative(v_{j}.val)$
- For $j =n-1$ to $0$:
Result:
- The final $\text{gradient}$ is $\nabla f(x_0)$
The algorithm requires storing the intermediate values $v_j$ from the forward pass, leading to a memory complexity of $O(n)$. However we reduced memory savings by factor 2 since we don't need to save all the 'grad' properties for each vertex. The computational complexity remains linear, $O(n)$.
A way to reduce the memory in factor $\sqrt(N)$ is called checkpoints. Checkpoints strategically reduces this requirement by storing only key computations and recomputing intermediate values during the backward pass as needed.
Checkpoints Algorithm:
Initialization:
- Determine checkpoints at strategic intervals (e.g., every $\sqrt{n}$ steps).
- Set $\text{gradient} \leftarrow 1$.
Forward Pass with Checkpoints:
- For each node $j$ from 0 to $n-1$:
- Compute $v_j$ and decide based on the checkpoint strategy whether to store $v_j$.
- If $j$ is a checkpoint, store $v_j$.
- For each node $j$ from 0 to $n-1$:
Backward Pass Using Checkpoints:
- For each node $j$ from $n-1$ to 0:
- If $v_j$ is not stored (not a checkpoint), recompute $v_j$ starting from the nearest previous checkpoint.
- Compute $\text{gradient} \leftarrow \text{gradient} \cdot v_{j+1}.fn'.derivative(v_{j}.val)$.
- For each node $j$ from $n-1$ to 0:
Memory and Computational Complexity:
- Memory Complexity: The memory complexity is reduced to $O(\sqrt{n})$ if checkpoints are set at every $\sqrt{n}$ steps.
- Computational Cost: The total computational cost remains $O(n)$. Each segment between checkpoints might require recomputation, but the total number of operations does not exceed $n$ significantly due to efficient checkpoint spacing.
4.3 In general computational graphs, achieving $O(1)$ memory usage like in ideal forward mode AD scenarios is not feasible. This limitation arises because complex graphs often have multiple paths leading to the final node $f_n$, each requiring the storage of intermediate values. Theoretically, if we could pre-determine a ll paths from $v_0$ to $v_n$, we could traverse each path separately and sequentially to minimize memory usage. However, this method is impractical due to the exponential number of potential paths in a general graph, which also precludes the use of parallel processing techniques.
Consequently, in practical settings, the memory usage for forward mode AD tends to be proportional to the amount of memory needed to store intermediate values necessary for evaluating the node $f_n$. Nonetheless, we can adopt checkpointing strategies—commonly used in both forward and backward mode AD—to store computational values only at strategic points. This approach helps in managing memory more effectively.
Additional strategies to further reduce memory usage include:
- Memory Release: Actively manage memory by releasing intermediate values that are no longer needed during computations.
- Mixed Mode AD: Utilize forward mode AD for sections of the graph with fewer paths from input to output, and apply backward mode AD for more complex sections of the graph. This hybrid approach leverages the strengths of both AD modes based on the specific structure of the network.
4.4 Large neural networks, characterized by their vast number of parameters and extensive inputs, typically require substantial memory to compute gradients during backpropagation. As networks deepen, the need to store intermediate values escalates, further increasing memory demands. Employing the aforementioned strategies, such as checkpointing and selective memory release, can significantly reduce the memory footprint required for training. These techniques not only make it feasible to train deeper and more complex models but also enhance the efficiency of the learning process by optimizing memory usage.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 3: Binary Classification with Multilayer Perceptrons¶
In this part we'llonly answer questions. We hope this HW reduction help you to finish on time :)
from cs236781.answers import display_answer
import hw2.answers
Question 1¶
Please explain the following errors and how to reduce them using the tools we learned in class. Use therms such as "receptive field" "population loss" etc.
- High Optimization error?
- High Generalization error?
- High Approximation error?
display_answer(hw2.answers.part3_q1)
Your answer:
High Optimization error:
High optimization error occurs when the model is unable to minimize the training loss effectively. This indicates that the model is not fitting the training data well, which can be due to poor optimization techniques, an insufficiently complex model, or issues inherent to the training process. In addition, neural networks often have highly non-convex loss surfaces, making it challenging to find the global optimum. Issues such as vanishing gradients, varying rates of convergence in different dimensions, and dependency on initialization can also contribute to high optimization error.
Solutions proposed during course:
- Learning Rate Adjustment:
- Use an appropriate learning rate: Too high a learning rate can cause the model to overshoot minima, while too low can lead to slow convergence.
- Learning Rate Scheduling: Reduce the learning rate every few epochs to improve convergence.
- Warmups: Gradually increase the learning rate at the start of training to stabilize initial updates.
- Advanced Optimization Algorithms:
- Adam, RMSprop, or AdaGrad: These optimizers adapt the learning rate and handle sparse gradients more effectively than standard gradient descent.
- SGD with Momentum: Enhances convergence rates and stability by accelerating the gradient vectors in the right directions.
- Receptive Field:
- Increase the receptive field: with larger receptive fields, neurons can extract more relevant and comprehensive features from the input data, this means the model can learn more effectively from the training data, leading to better fitting and reduced optimization error. This reduces issues like vanishing gradients, which can hinder the optimization process.
- Regularization Techniques:
- L2 Regularization (Weight Decay): Helps to prevent overfitting and improve the model's ability to optimize effectively.
- Dropout: Another regularization technique to prevent overfitting by randomly dropping units during training.
- Hyperparameter Tuning:
- Experiment with different hyperparameters such as batch size, learning rate schedules, and network architecture to find the optimal configuration.
- Proper Weight Initialization:
- Xavier Initialization: Use advanced initialization methods like Xavier or He initialization to set effective starting points for the weights, which helps in achieving faster and more stable convergence.
- Normalization:
- Batch Normalization: Stabilize and accelerate training by normalizing the inputs of each layer, which helps in maintaining consistent gradient behavior across different axes.
- Preprocessing Data: Normalize and preprocess data to ensure consistent gradient behavior across different axes. This can include standardization, normalization, and using techniques like batch normalization.
- Skip Connections:
- Introduce skip connections (e.g., as in ResNet) to mitigate vanishing gradients and ensure better gradient flow during backpropagation.
High Generalization Error:
High Generalization error arises when the model performs well on the training data but poorly on unseen data. This indicates overfitting to the training data. This happens because we train on empirical loss instead of population loss.
Solutions proposed during course:
- More Data: Gather more data or data that better represents the underlying distribution D
- Cross-Validation: Use cross-validation to find hyperparameters that that will not cause overfitting/underfitting.
- Regularization: Apply L1 or L2 regularization to penalize large weights.
- Early Stopping: Stop training when the validation loss starts to increase to prevent overfitting
- Mini-Batches: Use mini-batches during training to introduce some noise and prevent overfitting.
- Data Augmentation: Increase the diversity of the training data through techniques like rotation, flipping, and scaling.
- Adding Noise: Introduce noise to inputs or weights during training to regularize the model.
- Label Smoothing: Adjust labels slightly to make the model less confident and more generalizable.
- Dropout: Randomly deactivate neurons during training to prevent overfitting.
- Ensembles: Combine multiple models to improve robustness and generalization.
High approximation Error:
Approximation error occurs when the chosen hypothesis class H is not expressive enough (the model is too simple) to capture the underlying patterns in the data. This indicates underfitting, where the model lacks the capacity to learn the complexity of the data.
Solutions proposed during course:
- More Expressive Hypothesis Class: Use a more powerful hypothesis class, such as deep neural networks (DNNs)
- Increase Parameters: Add more layers and neurons to the network to enhance its capacity.
- Tailor the model to the specific domain, such as using convolutional neural networks (CNNs).
- Receptive Field Adjustments: In CNNs, increasing the receptive field as we go deeper in the network allows each layer to capture features at different levels of abstraction, reducing approximation error by enabling the model to learn more complex patterns.
- Feature Engineering: Create more informative features that capture the underlying patterns in the data. This can involve domain knowledge and techniques like polynomial features or interaction terms
Question 2¶
Consider a binary classifier.
Describe a case where you expect false positive rate (FPR) to be higher and one that false negative rate (FNR) to be higher.
display_answer(hw2.answers.part3_q2)
Your answer:
False Positive Rate (FPR): The proportion of negative instances that are incorrectly classified as positive. $$ \text{FPR} = \frac{\text{False Positive (FP)}}{\text{False Positive (FP)} + \text{True Negative(TN)}} $$ False Negative Rate (FNR): The proportion of positive instances that are incorrectly classified as negative. $$ \text{FNR} = \frac{\text{False Negative (FN)}}{\text{False Negative (FN)} + \text{True Positive(TP)}} $$
We expect a higher FPR when the ratio of positive to negative labels in the training dataset does not approximate the real-life ratio. For example, in email spam detection, using more spam emails than usual can make the classifier overly sensitive, leading to more legitimate emails being marked as spam (higher FPR). Additionally, setting a low decision threshold to avoid missing any spam (high cost of false negatives) further increases the FPR.
We expect a higher FNR when the ratio of positive to negative labels does not reflect real-life proportions, such as in disease detection with rare diseases. The scarcity of positive samples prevents the classifier from learning to detect them effectively, resulting in a higher FNR. Additionally, setting a high decision threshold to avoid false positives (high cost of false positives) can lead to missing actual positive cases, further increasing the FNR.
Question 3¶
You're training a binary classifier screening of a large cohort of patients for some disease, with the aim to detect the disease early, before any symptoms appear. You train the model on easy-to-obtain features, so screening each individual patient is simple and low-cost. In case the model classifies a patient as sick, she must then be sent to furhter testing in order to confirm the illness. Assume that these further tests are expensive and involve high-risk to the patient. Assume also that once diagnosed, a low-cost treatment exists.
You wish to screen as many people as possible at the lowest possible cost and loss of life. Would you still choose the same "optimal" point on the ROC curve as above? If not, how would you choose it? Answer these questions for two possible scenarios:
- A person with the disease will develop non-lethal symptoms that immediately confirm the diagnosis and can then be treated.
- A person with the disease shows no clear symptoms and may die with high probability if not diagnosed early enough, either by your model or by the expensive test.
Explain your answers.
display_answer(hw2.answers.part3_q3)
Your answer: The choice of the "optimal" point on the ROC curve for screening a large cohort of patients for a disease depends on the specific consequences of false positives (FP) and false negatives (FN). In the first scenario, where a person with the disease will develop non-lethal symptoms that confirm the diagnosis and can then be treated, it is crucial to minimize the False Positive Rate (FPR) to avoid unnecessary high-cost and high-risk tests, even if it means a moderate increase in the False Negative Rate (FNR), as these patients will eventually be diagnosed when symptoms appear. Conversely, in the second scenario, where a person with the disease shows no clear symptoms and may die with high probability if not diagnosed early enough, the priority shifts to maximizing the True Positive Rate (TPR) to ensure early detection and treatment, thus reducing the risk of death, even if it results in a higher FPR. Therefore, the optimal point on the ROC curve varies: it should favor a lower FPR in the first scenario and a higher TPR in the second, based on the differing costs and risks associated with FPs and FNs in each case.
Question 4¶
Explain why MLP would not be best to train over sequntial data. Consider text where each data point is one word and you want to classify the sentiment of a sentence.
display_answer(hw2.answers.part3_q4)
Your answer: An MLP is not ideal for training on sequential data such as text, where each data point is a word, and the goal is to classify the sentiment of a sentence. The main reason is that MLPs process inputs independently without considering the order in which they appear. An MLP gets one input at a time and lacks memory, making it problematic to handle sequential data.
One potential solution is to average all the word embeddings together, but this approach loses the positional information of the words. Consequently, sentences containing the same words in different orders would be indistinguishable, leading to confusion in the model. For example, the sentences "I am not happy" and "I am happy not" would appear the same to the MLP if their word embeddings are averaged, despite having different sentiments.
Another approach could be to concatenate the word embeddings to create a single input vector for the MLP. However, this has significant disadvantages. The resulting vectors can become excessively long, which increases the computational complexity and the risk of overfitting. Additionally, this method imposes a limit on the size of the sentences that can be processed, as the input size for the MLP needs to be fixed, necessitating either truncation of longer sentences or padding of shorter ones, both of which can lead to loss of information or inclusion of irrelevant data.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 4: Convolutional Neural Networks¶
In this part we will explore convolution networks. We'll implement a common block-based deep CNN pattern with an without residual connections.
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Reminder: Convolutional layers and networks¶
Convolutional layers are the most essential building blocks of the state of the art deep learning image classification models and also play an important role in many other tasks. As we saw in the tutorial, when applied to images, convolutional layers operate on and produce volumes (3D tensors) of activations.
A convenient way to interpret convolutional layers for images is as a collection of 3D learnable filters, each of which operates on a small spatial region of the input volume. Each filter is convolved with the input volume ("slides over it"), and a dot product is computed at each location followed by a non-linearity which produces one activation. All these activations produce a 2D plane known as a feature map. Multiple feature maps (one for each filter) comprise the output volume.

A crucial property of convolutional layers is their translation equivariance, i.e. shifting the input results in and equivalently shifted output. This produces the ability to detect features regardless of their spatial location in the input.
Convolutional network architectures usually follow a pattern basic repeating blocks: one or more convolution layers, each followed by a non-linearity (generally ReLU) and then a pooling layer to reduce spatial dimensions. Usually, the number of convolutional filters increases the deeper they are in the network. These layers are meant to extract features from the input. Then, one or more fully-connected layers is used to combine the extracted features into the required number of output class scores.
Building convolutional networks with PyTorch¶
PyTorch provides all the basic building blocks needed for creating a convolutional arcitecture within the torch.nn package.
Let's use them to create a basic convolutional network with the following architecture pattern:
[(CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
Here $N$ is the total number of convolutional layers, $P$ specifies how many convolutions to perform before each pooling layer and $M$ specifies the number of hidden fully-connected layers before the final output layer.
TODO: Complete the implementaion of the CNN class in the hw2/cnn.py module.
Use PyTorch's nn.Conv2d and nn.MaxPool2d for the convolution and pooling layers.
It's recommended to implement the missing functionality in the order of the class' methods.
from hw2.cnn import CNN
test_params = [
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=3, stride=1, padding=1),
activation_type='relu', activation_params=dict(),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=5, stride=2, padding=3),
activation_type='lrelu', activation_params=dict(negative_slope=0.05),
pooling_type='avg', pooling_params=dict(kernel_size=3),
),
dict(
in_size=(3,100,100), out_classes=3,
channels=[16]*5, pool_every=3, hidden_dims=[100]*1,
conv_params=dict(kernel_size=2, stride=2, padding=2),
activation_type='lrelu', activation_params=dict(negative_slope=0.1),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = CNN(**params)
print(f"\n=== test {i=} ===")
print(net)
torch.manual_seed(seed)
test_out = net(torch.ones(1, 3, 100, 100))
print(f'{test_out=}')
expected_out = torch.load(f'tests/assets/expected_conv_out_{i:02d}.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
(2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU()
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU()
(7): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU()
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=20000, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0745, -0.1058, 0.0928, 0.0476, 0.0057, 0.0051, 0.0938, -0.0582,
0.0573, 0.0583]], grad_fn=<AddmmBackward0>)
max_diff=0.0
=== test i=1 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(1): LeakyReLU(negative_slope=0.05)
(2): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(3): LeakyReLU(negative_slope=0.05)
(4): AvgPool2d(kernel_size=3, stride=3, padding=0)
(5): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(6): LeakyReLU(negative_slope=0.05)
(7): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(8): LeakyReLU(negative_slope=0.05)
(9): AvgPool2d(kernel_size=3, stride=3, padding=0)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=32, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.05)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.05)
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0724, -0.0030, 0.0637, -0.0073, 0.0932, -0.0662, -0.0656, 0.0076,
0.0193, 0.0241]], grad_fn=<AddmmBackward0>)
max_diff=0.0
=== test i=2 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(1): LeakyReLU(negative_slope=0.1)
(2): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(3): LeakyReLU(negative_slope=0.1)
(4): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(5): LeakyReLU(negative_slope=0.1)
(6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(7): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(8): LeakyReLU(negative_slope=0.1)
(9): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(10): LeakyReLU(negative_slope=0.1)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=400, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.1)
(2): Linear(in_features=100, out_features=3, bias=True)
(3): Identity()
)
)
)
test_out=tensor([[-0.0004, -0.0094, 0.0817]], grad_fn=<AddmmBackward0>)
max_diff=0.0
As before, we'll wrap our model with a Classifier that provides the necessary functionality for calculating probability scores and obtaining class label predictions.
This time, we'll use a simple approach that simply selects the class with the highest score.
TODO: Implement the ArgMaxClassifier in the hw2/classifier.py module.
from hw2.classifier import ArgMaxClassifier
model = ArgMaxClassifier(model=CNN(**test_params[0]))
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test.assertEqual(model.classify(test_image).shape, (1,))
test.assertEqual(model.predict_proba(test_image).shape, (1, 10))
test.assertAlmostEqual(torch.sum(model.predict_proba(test_image)).item(), 1.0, delta=1e-3)
Let's now load CIFAR-10 to use as our dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
x0,_ = ds_train[0]
in_size = x0.shape
num_classes = 10
print('input image size =', in_size)
Files already downloaded and verified
Files already downloaded and verified
Train: 50000 samples Test: 10000 samples input image size = torch.Size([3, 32, 32])
Now as usual, as a sanity test let's make sure we can overfit a tiny dataset with our model. But first we need to adapt our Trainer for PyTorch models.
TODO:
- Complete the implementaion of the
ClassifierTrainerclass in thehw2/training.pymodule if you haven't done so already. - Set the optimizer hyperparameters in
part4_optim_hp(), respectively, inhw2/answers.py.
from hw2.training import ClassifierTrainer
from hw2.answers import part4_optim_hp
torch.manual_seed(seed)
# Define a tiny part of the CIFAR-10 dataset to overfit it
batch_size = 2
max_batches = 25
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Create model, loss and optimizer instances
model = ArgMaxClassifier(
model=CNN(
in_size, num_classes, channels=[32], pool_every=1, hidden_dims=[100],
conv_params=dict(kernel_size=3, stride=1, padding=1),
pooling_params=dict(kernel_size=2),
)
)
hp_optim = part4_optim_hp()
loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)
# Use ClassifierTrainer to run only the training loop a few times.
trainer = ClassifierTrainer(model, loss_fn, optimizer, device)
best_acc = 0
for i in range(25):
res = trainer.train_epoch(dl_train, max_batches=max_batches, verbose=(i%5==0))
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
# Test overfitting
test.assertGreaterEqual(best_acc, 90)
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
Residual Networks¶
A very common addition to the basic convolutional architecture described above are shortcut connections. First proposed by He et al. (2016), this simple addition has been shown to be a crucial ingredient in order to achieve effective learning with very deep networks. Virtually all state of the art image classification models from recent years use this technique.
The idea is to add a shortcut, or skip, around every two or more convolutional layers:

On the left we see an example of a regular Residual Block, that takes a 64 channel input, and performs two 3X3 convolutions , which are added to the original input.
On the right we see an exapmle of a Bottleneck Residual Block, that takes a 256 channel input, projects it to a 64 channel tensor with a 1X1 convolution, then performs an inner 3X3 convolution, followd by another 1X1 projection convolution back to the original numer of channels, 256. The output is then added to the original input.
Overall, we can denote the structure of the bottleneck channels in the given example as 256->64->64->256, where the first and last arrows denote the 1X1 convolutions, and the middle arrow is the inner convolution. Note that the 1X1 convolution with the default parameters (in pytorch) is defined such that the only dimension of the tensor that changes is the number of channels.
This adds an easy way for the network to learn identity mappings: set the weight values to be very small. The outcome is that the convolutional layers learn a residual mapping, i.e. some delta that is applied to the identity map, instead of actually learning a completely new mapping from scratch.
Lets start by implementing a general residual block, representing a structure similar to the above diagrams. Our residual block will be composed of:
- A "main path" with some number of convolutional layers with ReLU between them. Optionally, we'll also apply dropout and batch normalization layers (in this order) between the convolutions, before the ReLU.
- A "shortcut path" implementing an identity mapping around the main path. In case of a different number of input/output channels, the shortcut path should contain an additional
1x1convolution to project the channel dimension. - The sum of the main and shortcut paths output is passed though a ReLU and returned.
TODO: Complete the implementation of the ResidualBlock's __init__() method in the hw2/cnn.py module.
from hw2.cnn import ResidualBlock
torch.manual_seed(seed)
resblock = ResidualBlock(
in_channels=3, channels=[6, 4]*2, kernel_sizes=[3, 5]*2,
batchnorm=True, dropout=0.2
)
print(resblock)
torch.manual_seed(seed)
test_out = resblock(torch.ones(1, 3, 32, 32))
print(f'{test_out.shape=}')
expected_out = torch.load('tests/assets/expected_resblock_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.2, inplace=False)
(2): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(5): Dropout2d(p=0.2, inplace=False)
(6): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): ReLU()
(8): Conv2d(4, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.2, inplace=False)
(10): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
test_out.shape=torch.Size([1, 4, 32, 32])
max_diff=5.960464477539062e-07
Bottleneck Blocks¶
In the ResNet Block diagram shown above, the right block is called a bottleneck block. This type of block is mainly used deep in the network, where the feature space becomes increasingly high-dimensional (i.e. there are many channels).
Instead of applying a KxK conv layer on the original input channels, a bottleneck block first projects to a lower number of features (channels), applies the KxK conv on the result, and then projects back to the original feature space. Both projections are performed with 1x1 convolutions.
TODO: Complete the implementation of the ResidualBottleneckBlock in the hw2/cnn.py module.
from hw2.cnn import ResidualBottleneckBlock
torch.manual_seed(seed)
resblock_bn = ResidualBottleneckBlock(
in_out_channels=256, inner_channels=[64, 32, 64], inner_kernel_sizes=[3, 5, 3],
batchnorm=False, dropout=0.1, activation_type="lrelu"
)
print(resblock_bn)
# Test a forward pass
torch.manual_seed(seed)
test_in = torch.ones(1, 256, 32, 32)
test_out = resblock_bn(test_in)
print(f'{test_out.shape=}')
assert test_out.shape == test_in.shape
expected_out = torch.load('tests/assets/expected_resblock_bn_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Dropout2d(p=0.1, inplace=False)
(5): LeakyReLU(negative_slope=0.01)
(6): Conv2d(64, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(7): Dropout2d(p=0.1, inplace=False)
(8): LeakyReLU(negative_slope=0.01)
(9): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): Dropout2d(p=0.1, inplace=False)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
test_out.shape=torch.Size([1, 256, 32, 32])
max_diff=1.1920928955078125e-07
Now, based on the ResidualBlock, we'll implement our own variation of a residual network (ResNet),
with the following architecture:
[-> (CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
\------- SKIP ------/
Note that $N$, $P$ and $M$ are as before, however now $P$ also controls the number of convolutional layers to add a skip-connection to.
TODO: Complete the implementation of the ResNet class in the hw2/cnn.py module.
You must use your ResidualBlocks or ResidualBottleneckBlocks to group together every $P$ convolutional layers.
from hw2.cnn import ResNet
test_params = [
dict(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
activation_type='lrelu', activation_params=dict(negative_slope=0.01),
pooling_type='avg', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
bottleneck=False
),
dict(
# create 64->16->64 bottlenecks
in_size=(3,100,100), out_classes=5, channels=[64, 16, 64]*4,
pool_every=3, hidden_dims=[64]*1,
activation_type='tanh',
pooling_type='max', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
bottleneck=True
)
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = ResNet(**params)
print(f"\n=== test {i=} ===")
print(net)
torch.manual_seed(seed)
test_out = net(torch.ones(1, 3, 100, 100))
print(f'{test_out=}')
expected_out = torch.load(f'tests/assets/expected_resnet_out_{i:02d}.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
ResNet(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.01)
(8): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.1, inplace=False)
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(2): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=160000, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0422, 0.0332, 0.1870, -0.0532, -0.0742, 0.1143, -0.0617, -0.0467,
0.0852, 0.0221]], grad_fn=<AddmmBackward0>)
max_diff=1.1920928955078125e-07
=== test i=1 ===
ResNet(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(64, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(2): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
(5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=2304, out_features=64, bias=True)
(1): Tanh()
(2): Linear(in_features=64, out_features=5, bias=True)
(3): Identity()
)
)
)
test_out=tensor([[ 0.0237, -0.1945, -0.0085, -0.4024, -0.2667]],
grad_fn=<AddmmBackward0>)
max_diff=2.3096799850463867e-07
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Question 1¶
Consider the bottleneck block from the right side of the ResNet diagram above. Compare it to a regular block that performs a two 3x3 convs directly on the 256-channel input (i.e. as shown in the left side of the diagram, with a different number of channels). Explain the differences between the regular block and the bottleneck block in terms of:
- Number of parameters. Calculate the exact numbers for these two examples.
- Number of floating point operations required to compute an output (qualitative assessment).
- Ability to combine the input: (1) spatially (within feature maps); (2) across feature maps.
display_answer(hw2.answers.part4_q1)
Your answer:
1. Number of parameters¶
Regular Block
- First 3x3 Convolution: Parameters = $F*F*C_{in}*C_{out}+ Bias$ = (3 * 3 * 256 * 256) + 256 = 590,080
- Second 3x3 Convolution: Parameters = $F*F*C_in*C_out+ Bias$ = (3 * 3 * 256 * 256) + 256 = 590,080
- Total Parameters for Regular Block: 590,080 (First Convolution) + 590,080 (Second Convolution) = 1,180,160
Bottleneck Block
- First 1x1 Convolution: Parameters = (1 * 1 * 256 * 64) + 64 = 16,384 + 64 = 16,448
- Second 3x3 Convolution: Parameters = (3 * 3 * 64 * 64) + 64 = 36,864 + 64 = 36,928
- Third 1x1 Convolution: Parameters = (1 * 1 * 64 * 256) + 256 = 16,384 + 256 = 16,640
- Total Parameters for Bottleneck Block: 16,448 (First Convolution) + 36,928 (Second Convolution) + 16,640 (Third Convolution) = 70,016
2. Number of floating points operations:¶
Regular Block
First 3x3 Convolution:
- Input: A tensor of shape $(C_{in},H,W)$ = (256,H,W)
- Output: A tensor of shape $(C_{out},H,W)$ = (256,H,W). This is true because we don't change the H,W dimensions in residual blocks to allow the sum with shortcut the end.
- FLOPS: For each element in 1 output feature map we will have to do $F*F*C_{in}$ = $3*3*256$ operations(ignoring the addition of bias term). We have 256 output feature maps and HW elements in each feature map so we will have to do $F*F*C_{in}*C_{out}HW = 3*3*256*256*HW = 589824*HW$ operations
Second 3x3 Convolution:
- Input: A tensor of shape $(C_{in},H,W)$ = (256,H,W)
- Output: A tensor of shape $(C_{out},H,W)$ = (256,H,W).
- Flops: Applying the same logic as before we get also $589824*HW$ operations
Total Flops for regular block: $1179648*HW$ operations
Bottleneck Block
First 1x1 Convolution:
- Input: A tensor of shape $(C_{in},H,W)$ = (256,H,W)
- Output: A tensor of shape $(C_{out},H,W)$ = (64,H,W). This is true because we don't change the H,W dimensions
- FLOPS: For each element in 1 output feature map we will have to do $F*F*C_{in} = 1*1*256$ operations(ignoring the addition of bias term). We have 64 output feature maps and HW elements in each feature map so we will have to do $F*F*C_{in}**C_{out}HW$ = $1*1*256*64*HW = 16384*HW$ operations
Second 3x3 Convolution:
- Input: A tensor of shape $(C_{in},H,W)$ = (64,H,W)
- Output: A tensor of shape $(C_{out},H,W)$ = (64,H,W).
- FLOPS: For each element in 1 output feature map we will have to do $F*F*C_{in} = 3*3*64$ operations. Therefore we get $3*3*64*64*HW = 36864HW$ operations
Third 1x1 Convolution:
Input: A tensor of shape $(C_{in},H,W)$ = (64,H,W)
Output: A tensor of shape $(C_{out},H,W)$ = (256,H,W).
FLOPS: For each element in 1 output feature map we will have to do $F*F*C_{in} = 1*1*264$ operations(ignoring the addition of bias term). We have 256 output feature maps and HW elements in each feature map so we will have to do $F*F*C_{in}**C_{out}HW$ = $1*1*256*64*HW = 16384*HW$ operations
Total Flops for bottleneck block: $2*16384*HW + 36864HW = 69632HW$ operations
3. Ability to Combine Input¶
Regular Block
Spatial Combination (Within Feature Maps):
- Each 3x3 convolution can combine information from a 3x3 neighborhood of pixels within each feature map.
- After two 3x3 convolutions, the receptive field is 5x5, meaning each output pixel can "see" a 5x5 area of the input.
- This allows for a relatively larger spatial context to be considered within each feature map.
Feature Map Combination (Across Feature Maps):
- Each 3x3 convolution operates on all 256 channels and produces 256 output channels.
- Thus, we also get a strong ability to mix elements across feature maps.
Bottleneck Block
Spatial Combination (Within Feature Maps): the regular block.
- The initial 1x1 convolution does not change the spatial context—it only combines information across feature maps.
- The 3x3 convolution then combines spatial information within a 3x3 neighborhood.
- The final 1x1 convolution again does not change the spatial context.
- Overall, the receptive field for the spatial combination in a bottleneck block is 3x3, which is smaller compared to
Feature Map Combination (Across Feature Maps):
- First 1x1 Convolution:
- This layer combines information across the feature maps (channels). It reduces the number of channels from 256 to 64. Each output channel in the 1x1 convolution is a combination of all 256 input channels. This allows the network to mix information from all feature maps and learn a compact representation.
- Second 3x3 Convolution:
- This operates on the reduced set of channels and combines information both in the same feature map and across feature maps.
- Third 1x1 Convolution:
- This expands the number of channels back to the original. Again, each output channel in this 1x1 convolution is a combination of all 64 input channels. This allows the network to mix information from the reduced feature maps and expand it back to a richer set of features.
Conclusions:¶
We can see that the regular residual block has a stronger ability to mix elements within feature maps compared to bottleneck blocks, and both have similar ability to combine information across feature maps. However, the bottleneck block has computational advantages in both FLOPs and number of parameters. So, we have to trade off between stronger ability to combine information and computational resources.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 5: Convolutional Architecture Experiments¶
In this part we will explore convolution networks and the effects of their architecture on accuracy. We'll use our deep CNN implementation and perform various experiments on it while varying the architecture. Then we'll implement our own custom architecture to see whether we can get high classification results on a large subset of CIFAR-10.
Training will be performed on GPU.
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Experimenting with model architectures¶
We will now perform a series of experiments that train various model configurations on a part of the CIFAR-10 dataset.
To perform the experiments, you'll need to use a machine with a GPU since training time might be too long otherwise.
Note about running on GPUs¶
Here's an example of running a forward pass on the GPU (assuming you're running this notebook on a GPU-enabled machine).
from hw2.cnn import ResNet
net = ResNet(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
pooling_type='avg', pooling_params=dict(kernel_size=2),
)
net = net.to(device)
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_image = test_image.to(device)
test_out = net(test_image)
Notice how we called .to(device) on both the model and the input tensor.
Here the device is a torch.device object that we created above. If an nvidia GPU is available on the machine you're running this on, the device will be 'cuda'. When you run .to(device) on a model, it recursively goes over all the model parameter tensors and copies their memory to the GPU. Similarly, calling .to(device) on the input image also copies it.
In order to train on a GPU, you need to make sure to move all your tensors to it. You'll get errors if you try to mix CPU and GPU tensors in a computation.
print(f'This notebook is running with device={device}')
print(f'The model parameter tensors are also on device={next(net.parameters()).device}')
print(f'The test image is also on device={test_image.device}')
print(f'The output is therefore also on device={test_out.device}')
This notebook is running with device=cuda The model parameter tensors are also on device=cuda:0 The test image is also on device=cuda:0 The output is therefore also on device=cuda:0
Notes on using course servers¶
First, please read the course servers guide carefully.
To run the experiments on the course servers, you can use the py-sbatch.sh script directly to perform a single experiment run in batch mode (since it runs python once), or use the srun command to do a single run in interactive mode. For example, running a single run of experiment 1 interactively (after conda activate of course):
srun -c 2 --gres=gpu:1 --pty python -m hw2.experiments run-exp -n test -K 32 64 -L 2 -P 2 -H 100
To perform multiple runs in batch mode with sbatch (e.g. for running all the configurations of an experiments), you can create your own script based on py-sbatch.sh and invoke whatever commands you need within it.
Don't request more than 2 CPU cores and 1 GPU device for your runs. The code won't be able to utilize more than that anyway, so you'll see no performance gain if you do. It will only cause delays for other students using the servers.
General notes for running experiments¶
- You can run the experiments on a different machine (e.g. the course servers) and copy the results (files)
to the
resultsfolder on your local machine. This notebook will only display the results, not run the actual experiment code (except for a demo run). - It's important to give each experiment run a name as specified by the notebook instructions later on.
Each run has a
run_nameparameter that will also be the base name of the results file which this notebook will expect to load. - You will implement the code to run the experiments in the
hw2/experiments.pymodule. This module has a CLI parser so that you can invoke it as a script and pass in all the configuration parameters for a single experiment run. - You should use
python -m hw2.experiments run-expto run an experiment, and notpython hw2/experiments.py run-exp, regardless of how/where you run it.
Experiment 1: Network depth and number of filters¶
In this part we will test some different architecture configurations based on our CNN and ResNet.
Specifically, we want to try different depths and number of features to see the effects these parameters have on the model's performance.
To do this, we'll define two extra hyperparameters for our model, K (filters_per_layer) and L (layers_per_block).
Kis a list, containing the number of filters we want to have in our conv layers.Lis the number of consecutive layers with the same number of filters to use.
For example, if K=[32, 64] and L=2 it means we want two conv layers with 32 filters followed by two conv layers with 64 filters. If we also use pool_every=3, the feature-extraction part of our model will be:
Conv(X,32)->ReLu->Conv(32,32)->ReLU->Conv(32,64)->ReLU->MaxPool->Conv(64,64)->ReLU
We'll try various values of the K and L parameters in combination and see how each architecture trains. All other hyperparameters are up to you, including the choice of the optimization algorithm, the learning rate, regularization and architecture hyperparams such as pool_every and hidden_dims. Note that you should select the pool_every parameter wisely per experiment so that you don't end up with zero-width feature maps.
You can try some short manual runs to determine some good values for the hyperparameters or implement cross-validation to do it. However, the dataset size you test on should be large. If you limit the number of batches, make sure to use at least 30000 training images and 5000 validation images.
The important thing is that you state what you used, how you decided on it, and explain your results based on that.
First we need to write some code to run the experiment.
TODO:
- Implement the
cnn_experiment()function in thehw2/experiments.pymodule. - If you haven't done so already, it would be an excellent idea to implement the early stopping feature of the
Trainerclass.
The following block tests that your implementation works. It's also meant to show you that each experiment run creates a result file containing the parameters to reproduce and the FitResult object for plotting.
from hw2.experiments import load_experiment, cnn_experiment
from cs236781.plot import plot_fit
# Test experiment1 implementation on a few data samples and with a small model
cnn_experiment(
'test_run', seed=seed, bs_train=50, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64], layers_per_block=1, pool_every=1, hidden_dims=[100],
model_type='resnet',
)
# There should now be a file 'test_run.json' in your `results/` folder.
# We can use it to load the results of the experiment.
cfg, fit_res = load_experiment('results/test_run_L1_K32-64.json')
_, _ = plot_fit(fit_res, train_test_overlay=True)
# And `cfg` contains the exact parameters to reproduce it
print('experiment config: ', cfg)
Files already downloaded and verified
Files already downloaded and verified
--- EPOCH 1/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 2/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 3/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 4/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 5/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 6/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 7/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 8/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 9/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
--- EPOCH 10/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/10 [00:00<?, ?it/s]
*** Output file ./results/test_run_L1_K32-64.json written
experiment config: {'run_name': 'test_run', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'filters_per_layer': [32, 64], 'pool_every': 1, 'hidden_dims': [100], 'model_type': 'resnet', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.2, 'bottleneck': False, 'kw': {}, 'layers_per_block': 1}
def plot_exp_results(filename_pattern, results_dir='results'):
fig = None
result_files = glob.glob(os.path.join(results_dir, filename_pattern))
result_files.sort()
if len(result_files) == 0:
print(f'No results found for pattern {filename_pattern}.', file=sys.stderr)
return
for filepath in result_files:
m = re.match('exp\d_(\d_)?(.*)\.json', os.path.basename(filepath))
cfg, fit_res = load_experiment(filepath)
fig, axes = plot_fit(fit_res, fig, legend=m[2],log_loss=True)
del cfg['filters_per_layer']
del cfg['layers_per_block']
print('common config: ', cfg)
Experiment 1.1: Varying the network depth (L)¶
First, we'll test the effect of the network depth on training.
Configuratons:
K=32fixed, withL=2,4,8,16varying per runK=64fixed, withL=2,4,8,16varying per run
So 8 different runs in total.
Naming runs:
Each run should be named exp1_1_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_1_L2_K32.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_1_L*_K32*.json')
common config: {'run_name': 'exp1_1', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 30, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 4, 'hidden_dims': [128], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': False, 'dropout': 0.0, 'bottleneck': False, 'kw': {}}
plot_exp_results('exp1_1_L*_K64*.json')
common config: {'run_name': 'exp1_1', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 30, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 4, 'hidden_dims': [128], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': False, 'dropout': 0.0, 'bottleneck': False, 'kw': {}}
Experiment 1.2: Varying the number of filters per layer (K)¶
Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
L=2fixed, withK=[32],[64],[128]varying per run.L=4fixed, withK=[32],[64],[128]varying per run.L=8fixed, withK=[32],[64],[128]varying per run.
So 9 different runs in total. To clarify, each run K takes the value of a list with a single element.
Naming runs:
Each run should be named exp1_2_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_2_L2_K32.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_2_L2*.json')
common config: {'run_name': 'exp1_2', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 30, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 4, 'hidden_dims': [128], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': False, 'dropout': 0.0, 'bottleneck': False, 'kw': {}}
plot_exp_results('exp1_2_L4*.json')
common config: {'run_name': 'exp1_2', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 30, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 4, 'hidden_dims': [128], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': False, 'dropout': 0.0, 'bottleneck': False, 'kw': {}}
plot_exp_results('exp1_2_L8*.json')
common config: {'run_name': 'exp1_2', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 30, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 4, 'hidden_dims': [128], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': False, 'dropout': 0.0, 'bottleneck': False, 'kw': {}}
Experiment 1.3: Varying both the number of filters (K) and network depth (L)¶
Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
K=[64, 128]fixed withL=2,3,4varying per run.
So 3 different runs in total. To clarify, each run K takes the value of an array with a two elements.
Naming runs:
Each run should be named exp1_3_L{}_K{}-{} where the braces are placeholders for the values. For example, the first run should be named exp1_3_L2_K64-128.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_3*.json')
common config: {'run_name': 'exp1_3', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 30, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 4, 'hidden_dims': [128], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': False, 'dropout': 0.0, 'bottleneck': False, 'kw': {}}
Experiment 1.4: Adding depth with Residual Networks¶
Now we'll test the effect of skip connections on the training and performance.
Configuratons:
K=[32]fixed withL=8,16,32varying per run.K=[64, 128, 256]fixed withL=2,4,8varying per run.
So 6 different runs in total.
Naming runs:
Each run should be named exp1_4_L{}_K{}-{}-{} where the braces are placeholders for the values.
TODO: Run the experiment on the above configuration with the ResNet model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_4_L*_K32.json')
common config: {'run_name': 'exp1_4', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 60, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 8, 'hidden_dims': [512, 256], 'model_type': 'resnet', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.2, 'bottleneck': False, 'kw': {}}
plot_exp_results('exp1_4_L*_K64*.json')
common config: {'run_name': 'exp1_4', 'out_dir': '/home/galkesten/CS236781/Homework2/results', 'seed': 42, 'device': None, 'bs_train': 32, 'bs_test': 32, 'batches': 1500, 'epochs': 60, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.0001, 'reg': 0.0001, 'pool_every': 8, 'hidden_dims': [512, 256], 'model_type': 'resnet', 'conv_params': {'kernel_size': 3, 'stride': 1, 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'max', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.2, 'bottleneck': False, 'kw': {}}
Questions¶
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Question 1¶
Analyze your results from experiment 1.1. In particular,
- Explain the effect of depth on the accuracy. What depth produces the best results and why do you think that's the case?
- Were there values of
Lfor which the network wasn't trainable? what causes this? Suggest two things which may be done to resolve it at least partially.
display_answer(hw2.answers.part5_q1)
Your answer: In this experiment, we used the Adam optimizer, max pooling every 4 layers, a learning rate of 0.0001 and regularization 0.0001. batch size 32, early stopping=3. This is also the hyperparameters that was chosen for experiments 1.2,1.3. The results of Experiment 1.1 show that the depth of the network impacts accuracy. The L16 configurations (16 layers per block) with both 32 and 64 filters per layer were non-trainable, with test accuracy stagnant at 10%. This is likely due to vanishing gradients and overfitting, where the network struggles to learn effectively. The L8 configurations (8 layers per block) showed better performance, with L8_K32 reaching approximately 60.18% test accuracy and L8_K64 reaching around 61.47%. Although these results were better than L16, they were not the best observed.
The L4 configurations produced the best results. The L4_K32 configuration achieved a test accuracy of about 63.99%, while L4_K64 reached around 65.51%. This works better compared to L2 and L8. The L2 configurations also performed well, particularly with more filters. The L2_K32 configuration achieved about 63.68% test accuracy, and L2_K64 reached around 63.08%. We also believe that max pooling on the fourth layer helped in enhancing the results of L4 compared to L2 configuration.
In choosing hyperparameters, we focused on ensuring convergence and avoiding vanishing gradients. We manually tuned various hyperparameters, including learning rate and regularization. A very small learning rate (0.0001) was necessary for the L8 configurations to ensure stable learning and convergence. We also had to use small regularization values, as larger ones would encourage smaller weights and exacerbate the vanishing gradient problem. This led to observable overfitting, as seen in the training and testing loss graphs, but was necessary to prevent the gradients from vanishing entirely and to allow the deeper network to learn. Despite the extensive tuning, the highest accuracy achieved was around 65%, indicating room for improvement.
To address the vanishing gradients problem in very deep networks,we can use batch normalization which stabilize the learning process by normalizing the inputs of each layer. Additionally, adding residual connections (as in ResNet architectures) can provide shortcut paths for gradients, allowing for more effective training of deep networks. These strategies can mitigate the problems associated with training very deep networks and improve their performance. T
Question 2¶
Analyze your results from experiment 1.2. In particular, compare to the results of experiment 1.1.
display_answer(hw2.answers.part5_q2)
The results of Experiment 1.2 provide further insights into the effect of varying the number of filters per layer (K)
in combination with different network depths (L).
For the L2 configurations, we observed a slight improvement in test accuracy as K increased,
Despite using early stopping, the L2_K128 configuration converged too
quickly, necessitating stopping after less than 10 epochs. We were unable to use a changing learning rate to control
convergence effectively.
This shows that for shallow network, bigger amount of feature maps has
the possibility to improve accuracy if we control overfitting.
The L4 configurations showed similar trends to L2, with K=128 leading to overfitting
and K=64 yielding the best performance. For the L8 configurations, we observed similar results but
with performance being less good than the L4 and L2 configurations.
Comparing these results to Experiment 1.1, we again see that the performance for L8 is less favorable than the
other depths, likely due to the vanishing gradients problem. The L4 configuration with K=64 achieved the best
performance, consistent with previous findings.
Across both experiments, overfitting remains a significant issue that we struggled to control,
caused by the need to choose hyper parameters that will allow L8 configuration to converge.
Question 3¶
Analyze your results from experiment 1.3.
display_answer(hw2.answers.part5_q3)
Your answer: Experiment 1.3 explored varying both the number of filters and the network depth. The L2 configuration with layers [64, 64, 128, 128] achieved the highest test accuracy of approximately 65.54%. The L3 configuration with layers [64, 64, 64, 128, 128, 128] showed a lower peak test accuracy (64.01%). There is probably an influence of vanishing gradients phenomena also with 6 depth network. The L4 configuration with layers [64, 64, 64, 64, 128, 128, 128, 128] exhibited the most significant overfitting, achieving a peak accuracy of around 64.25% before declining.
Again, we see that the performance of 8 depth layer has a decrease in performance, despite the fact we expect it to learn better with the ability to learn hierarchical features. Additionally, we do not see significant improvement by employing the 64-128 configuration compared to four layers of 64 as before. We also observe that incorporating 128 filters led to quicker overfitting, necessitating early stopping to prevent excessive epochs, consistent with the behavior observed previously.
Question 4¶
Analyze your results from experiment 1.4. Compare to experiment 1.1 and 1.3.
display_answer(hw2.answers.part5_q4)
In Experiment 1.4, we explored the impact of skip connections (Residual Networks) on training and performance. We tuned the hyperparameters manually, using the same learning rate, weight decay, and Adam optimizer as before. However, we used max pooling every 8 layers in these experiments to allow effective learning in deeper networks (we had to choose the same hyperparameters for all network configurations). Additionally, dropout of 0.2 and batch normalization were incorporated to help manage overfitting.
The L8_K32 configuration reached a test accuracy of approximately 69.19%. The model converged well but began to overfit after 10 epochs, as indicated by the rising test loss. The training accuracy continued to increase steadily, suggesting that the model was learning effectively but entered overfitting due to the small filter size and increased depth.
The L16_K32 configuration achieved a higher test accuracy of around 72.99%,
indicating that adding more layers improved the model's capacity to learn complex features.
However, overfitting was still a concern as the test loss increased after about 12 epochs.
The training accuracy was high, with early stopping helping to prevent excessive overfitting.
The`L32_K32 configuration achieved the highest test accuracy in this set, reaching around 74.82%.
Despite the increased depth, the model benefitted from skip connections,
which mitigated some of the vanishing gradient issues. The training loss decreased consistently,
and the accuracy improved steadily, indicating effective learning.
We were also able to train the network for a longer time before early
stopping compared to other configurations in this experiment.
In the K=[64, 128, 256] configurations, the L2_K64-128-256 configuration achieved a test accuracy of
approximately 66.56%. The model learned quickly but showed significant overfitting
due to the high filter sizes and shallow depth (6 layers).
The training accuracy increased rapidly, suggesting the model had enough capacity to learn complex patterns,
but the generalization was poor. The L4_K64-128-256 configuration performed better, achieving
a test accuracy of around 71.91%. The increased depth helped improve generalization,
though overfitting remained a challenge. The model showed a steady increase in training accuracy,
with early stopping helping to control overfitting. The L8_K64-128-256 configuration achieved the highest
test accuracy in this set, reaching around 75.68%. The deeper network benefitted significantly from the residual
connections, which helped mitigate vanishing gradients and improved learning.
The model showed the best training performance, with high training accuracy and a more controlled overfitting pattern.
Overall, we observed several key points from both sets of experiments: performance of deeper models.
- Skip Connections: The skip connections allowed us to train deeper networks compared to previous experiments. We were able to train networks with depths of 8, 16, and 32 layers without encountering vanishing gradient problems. Compared to previous experiments without skip connections, the use of residual networks significantly improved the
- Depth: Deeper networks performed best in both sets of experiments. This is likely due to their ability to learn hierarchical features, combined with max pooling every 8 layers, which gradually increased the receptive field. The deeper networks also generalized better compared to the shallow networks and networks from previous experiments.
- Increased Filter Sizes: The increased filter sizes (
K=64-128-256configuration) helped achieve slightly better results compared to theL32_K32configuration. However, this configuration is suitable only for deeper networks, as observed. In shallow networks, we reached overfitting quite quickly, similar to the behavior in Experiment 1.3. - Hyperparameter Tuning: We believe that with appropriate hyperparameter tuning for the deeper networks, we can achieve even better results.
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} $$
Part 6: YOLO - Objects Detection¶
In this part we will use an object detection architecture called YOLO (You only look once) to detect objects in images. We'll use an already trained model weights (v5) found here: https://github.com/ultralytics/yolov5
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load the YOLO model
model = torch.hub.load("ultralytics/yolov5", "yolov5s")
model.to(device)
# Images
img1 = 'imgs/DolphinsInTheSky.jpg'
img2 = 'imgs/cat-shiba-inu-2.jpg'
Using cache found in /home/galkesten/.cache/torch/hub/ultralytics_yolov5_master
YOLOv5 🚀 2024-7-22 Python-3.8.12 torch-1.10.1 CUDA:0 (NVIDIA TITAN Xp, 12190MiB)
Fusing layers...
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
Adding AutoShape...
Inference with YOLO¶
You are provided with 2 images (img1 and img2). TODO:
- Detect objects using the YOLOv5 model for these 2 images.
- Print the inference output with bounding boxes.
- Look at the inference results and answer the question below.
%matplotlib inline
import torch
import cv2
import numpy as np
from matplotlib import pyplot as plt
with torch.no_grad():
imgs = [img1, img2]
results = model(imgs)
dfs = results.pandas().xyxy
for i, df in enumerate(dfs):
img = cv2.cvtColor(cv2.imread(imgs[i]), cv2.COLOR_BGR2RGB)
img_height, img_width, _ = img.shape
plt.figure(figsize=(10, 10))
plt.imshow(img)
# Draw bounding boxes and labels
for _, row in df.iterrows():
xmin = int(row['xmin'])
ymin = int(row['ymin'])
xmax = int(row['xmax'])
ymax = int(row['ymax'])
label = f"{row['name']}: {row['confidence']:.2f}"
# Draw bounding box
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, color='red', linewidth=2)
plt.gca().add_patch(rect)
plt.text(xmin, ymin - 10, label, color='red', fontsize=8, bbox=dict(facecolor='white', alpha=0.8))
plt.axis('off')
plt.show()
from cs236781.answers import display_answer
import hw2.answers
Question 1¶
Analyze the inference results of the 2 images.
- How well did the model detect the objects in the pictures? with what confidance?
- What can possibly be the reason for the model failures? suggest methods to resolve that issue.
- recall that we learned how to fool a model by adverserial attack (PGD), describe how you would attack an Object Detection model (such as YOLO).
display_answer(hw2.answers.part6_q1)
Your answer: The model exhibits poor performance in detection and also struggles with some localization issues.
1.1
Detailed explanation:
Image 1
- Localization: There are actually three dolphins in the image. Two of them are close together, which might have confused the model, resulting in a single bounding box covering both dolphins. The third dolphin's tail is detected as a separate object.
- Detection
- The detection performance is poor. The model incorrectly identified:
- Two dolphins as "person" with confidence scores of $0.53$ and $0.90$.
- The tail of a dolphin as a "surfboard" with a confidence score of $0.37$.
- The model is quite confident ($90\%$) about one of its incorrect predictions, indicating a significant detection error.
Image 2
Localization
The model localized three objects(the dogs) but failed to detect the cat.
Detection
The model made the following predictions:
- Two dogs were labeled as "cat" with confidence scores of $0.65$ and $0.39$.
- The third dog was labeled correctly but with a low confidence score of $0.50$.
The model's predictions show confusion between cats and dogs, indicating issues with the classification performance.
1.2 Failure Reasons for the First Image the model fails because "dolphin" is not a class in the YOLOv5 model's training data, causing it to mislabel the dolphins as other classes. Regarding the problem of mislocalizing the two dolphins, it may be related to the black shadow that merges the objects, making it difficult for the model to distinguish between them. Moreover,it's likely that the model hasn't been trained on enough dolphin images, especially those taken in similar lighting conditions. Also, there might be a bias in the training data towards pictures of people surfing at sunset, leading the model to wrongly label dolphins as people and surfboards.
Failure Reasons for the Second Image In the second image, even though the resolution is better, the model still struggles with classification. It confuses dogs for cats, probably because of the dogs' cat-like ear shapes and poses. This suggests that the model hasn't seen enough examples of these variations in the training data. The unusual poses and features of the dogs might be confusing the model, causing these incorrect classifications. Additionally, the model completely misses detecting one of the cats. This could be due to how anchor boxes are used in YOLOv5. YOLOv5 uses anchor boxes to predict bounding boxes around objects. Anchor boxes are pre-defined boxes with specific heights and widths that the model uses as a reference. If these predefined anchor boxes don't match the sizes and shapes of the objects in the image well, the model might have trouble detecting them. This mismatch can lead to missed objects or inaccurate localization. By optimizing these anchor boxes to better fit the objects in the training data, the model's performance can be improved.
**Solutions*** To address these issues, we need to improve the training dataset with a wider variety of images. This means adding more pictures of dolphins in different lighting conditions and dogs with various ear shapes and poses. Using data augmentation techniques can also help make the training data more diverse. For the first image, increasing the contrast between the two dolphins can help the model distinguish them better. Additionally, optimizing anchor boxes by analyzing the dataset can ensure they better match the size and shapes of the objects in the images. This involves checking the training data to find the most common object sizes and shapes and adjusting the anchor boxes accordingly. Refining how the bounding boxes are set up can also help the model detect and localize objects more accurately. Using more bounding boxes per grid cell can also improve the detection of objects with varying sizes and shapes, further enhancing the model's performance. Finally, we should the model on more classes like dolphin.
1.3 Using Projected Gradient Descent (PGD), we can generate adversarial examples that target YOLO's classification, localization, and confidence predictions. The process involves adding small perturbations to the input image, calculating the gradient of the targeted loss function (classification, localization, confidence losses, or a combination), and iteratively updating the perturbation to maximize the loss. By projecting the perturbed image back into the valid input space to keep changes realistic, we create adversarial images that cause the model to misclassify objects, alter bounding box coordinates, miss objects entirely, or detect nonexistent objects. This method disrupts the model's performance while keeping the adversarial changes imperceptible to human observers.
Creative Detection Failures¶
Object detection pitfalls could be, for example: occlusion - when the objects are partially occlude, and thus missing important features, model bias - when a model learn some bias about an object, it could recognize it as something else in a different setup, and many others like Deformation, Illumination conditions, Cluttered or textured background and blurring due to moving objects.
TODO: Take pictures and that demonstrates 3 of the above object detection pitfalls, run inference and analyze the results.
#Insert the inference code here.
%matplotlib inline
import torch
import cv2
import numpy as np
from matplotlib import pyplot as plt
with torch.no_grad():
imgs = ['imgs/background.jpeg', 'imgs/partial.jpeg', 'imgs/dark.jpeg']
results = model(imgs)
dfs = results.pandas().xyxy
for i, df in enumerate(dfs):
img = cv2.cvtColor(cv2.imread(imgs[i]), cv2.COLOR_BGR2RGB)
img_height, img_width, _ = img.shape
plt.figure(figsize=(10, 10))
plt.imshow(img)
# Draw bounding boxes and labels
for _, row in df.iterrows():
xmin = int(row['xmin'])
ymin = int(row['ymin'])
xmax = int(row['xmax'])
ymax = int(row['ymax'])
label = f"{row['name']}: {row['confidence']:.2f}"
# Draw bounding box
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, color='red', linewidth=2)
plt.gca().add_patch(rect)
plt.text(xmin, ymin - 10, label, color='red', fontsize=8, bbox=dict(facecolor='white', alpha=0.8))
plt.axis('off')
plt.show()
Question 3¶
Analyize the results of the inference.
- How well did the model detect the objects in the pictures? explain.
display_answer(hw2.answers.part6_q3)
Your answer:
Picture 1: Cluttered Background
Description: A cluttered bookshelf with many overlapping dolls and books.
Inference Results: The detector identified 18 books and incorrectly identified a person’s hand as a person. It failed to detect any of the dolls, likely due to the high number of objects and clutter. Even among the books, some were missed, and the model's confidence in its detections was low.
Picture 2: Partial Occlusion and Model Bias
Description: A book is photographed at an angle, with parts of it excluded.
Inference Results: The model misclassified the book as a laptop, indicating a bias. The model seems to associate certain angles with laptops and expects books to be in specific orientations and fully visible to classify them correctly. In a previous example, books in a library setting were correctly identified due to the context and orientation.
Picture 3: Illumination Conditions and Partial Occlusion
Description: The image is taken in a dark room with poor lighting.
Inference Results: The detector failed to identify the table due to the poor lighting, making it difficult to distinguish objects. It misclassified the laundry hanger as a chair, which is reasonable due to the fact the model doesnt classify laundry hanger. The couch was also poorly localized due to the lighting conditions.
Bonus¶
Try improving the model performance over poorly recognized images by changing them. Describe the manipulations you did to the pictures.
#insert bonus code here
%matplotlib inline
import torch
import cv2
import numpy as np
from matplotlib import pyplot as plt
with torch.no_grad():
imgs = ['imgs/partialRotated.jpeg']
results = model(imgs)
dfs = results.pandas().xyxy
for i, df in enumerate(dfs):
img = cv2.cvtColor(cv2.imread(imgs[i]), cv2.COLOR_BGR2RGB)
img_height, img_width, _ = img.shape
plt.figure(figsize=(10, 10))
plt.imshow(img)
# Draw bounding boxes and labels
for _, row in df.iterrows():
xmin = int(row['xmin'])
ymin = int(row['ymin'])
xmax = int(row['xmax'])
ymax = int(row['ymax'])
label = f"{row['name']}: {row['confidence']:.2f}"
# Draw bounding box
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, color='red', linewidth=2)
plt.gca().add_patch(rect)
plt.text(xmin, ymin - 10, label, color='red', fontsize=8, bbox=dict(facecolor='white', alpha=0.8))
plt.axis('off')
plt.show()
display_answer(hw2.answers.part6_bonus)
Your answer: For the partial photo, we flipped the image 180 degrees and received a correct prediction (book) with low confidence. This strengthens our observation that the model has a bias regarding the angle of the photo when distinguishing between a computer and a book.
For the other photos, we couldn't resolve the issues.
Cluttered Background: The model fails to recognize the dolls as objects. Even when we photographed a single doll, the model did not recognize it.
Dark Photo: We attempted to lighten the background but were unsuccessful in both recognizing the table and maintaining accurate recognition of the couch. Additionally, the model kept misclassifying the table as a chair or oven. We believe that lightening the photo introduced a lot of noise, further confusing the model.